Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Databases in Bioinformatics (Roald Forsberg). 2 Overview The role of databases in bioinformatics The structure of databases –Relational databases –Database.

Similar presentations


Presentation on theme: "1 Databases in Bioinformatics (Roald Forsberg). 2 Overview The role of databases in bioinformatics The structure of databases –Relational databases –Database."— Presentation transcript:

1 1 Databases in Bioinformatics (Roald Forsberg)

2 2 Overview The role of databases in bioinformatics The structure of databases –Relational databases –Database Management Systems –Accessing databases Types of databases –Data types –Integrated databases (Entrez) Nucleotide sequence formats –FASTA format –GenBank format –XML formats

3 3 Databases in Bioinformatics Bioinformatics – attempted definition: “The application of computational techniques to understand and organise the information associated with biological macromolecules” Adapted from Oxford English Dictionary Biological experiments Databases Computational Biology

4 4 Ask your neighbour What would you like to do with a database? Which types of biological information could be stored in a database?

5 5 Use of databases Homology searching: –Use of knowledge from other often more well described organisms such as the model organisms Mouse, Drosophila, Fugu, C.Elegans etc.. –Sequence level – position, annotation –Structural level – proteins, RNA Evolutionary analyses: –Phylogenetics –Population genetics –Molecular evolution of genetic elements –Genome evolution Primer design Microarray design Drug design Many more……

6 6 General types of databases Primary –Raw and non-processed data Secondary –Curated – data chosen from criteria –E.g non-redundance, fold Tertiary –Data processed –HMM profile

7 7 Structure of relational databases MEQ147631 MEQ147632 MEQ147633 MEQ147634 MEQ147635 MEQ147636 MEQ147637 MEQ147638 MEQ147639 MEQ147640 MEQ147641 Entries Table 1 Table 2 Table = genetic element Field = position = chr. 4 Field = size = 3540 bp Field = coding = true Field = known EST = true Field = known structure = false

8 8 Structure of relational databases File Database files Database Management system Interface (WEB) File Terminal input scripts DBMS software SQL language (Structural Query Language) Terminal output Stored results Queries To DMBS Browser input scripts Results from DMBS Queries To data Structure of data Browser output Result files Results from DMBS

9 9 Database management systems A software package designed to store and manage databases. A computerized record-keeping system Allows operations such as: –Adding new files –Inserting data into existing files –Retrieving data from existing files –Changing data –Deleting data –Removing existing files from the database

10 10 Accessing a database WEB – graphical user interface (GUI) WEB – automated procedures –Batch search with script (Entrez) –Search robots with e-mail updates Local –Buy a big computer and a thick cable –Speed improvement

11 11 Protein sequence databases Database URL Protein sequence (primary) SWISS-PROT www.expasy.ch/sprot/sprot-top.htmlwww.expasy.ch/sprot/sprot-top.html PIR-Internationalwww.mips.biochem.mpg.de/proj/protseqdbwww.mips.biochem.mpg.de/proj/protseqdb Protein sequence (composite) OWL www.bioinf.man.ac.uk/dbbrowser/OWLwww.bioinf.man.ac.uk/dbbrowser/OWL NRDB www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Proteinwww.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein Protein sequence (secondary) PROSITE www.expasy.ch/prositewww.expasy.ch/prosite PRINTS www.bioinf.man.ac.uk/dbbrowser/PRINTS/PRINTS.htmlwww.bioinf.man.ac.uk/dbbrowser/PRINTS/PRINTS.html Pfam www.sanger.ac.uk/Pfam/www.sanger.ac.uk/Pfam

12 12 Nucleotide sequence databases GenBankwww.ncbi.nlm.nih.gov/Genbankwww.ncbi.nlm.nih.gov/Genbank EMBLwww.ebi.ac.uk/emblwww.ebi.ac.uk/embl DDBJwww.ddbj.nig.ac.jpwww.ddbj.nig.ac.jp

13 13 Types of nucleotide data cDNA –Reversely transcribed from mRNA Genomic sequences –Directly sequenced from DNA strings of various species EST’s –a tiny portion of an entire gene derived from mRNA

14 14 Macromolecular structure databases Protein Data Bank (PDB) www.rcsb.org/pdbwww.rcsb.org/pdb Nucleic Acids Database (NDB) http://ndbserver.rutgers.edu//http://ndbserver.rutgers.edu// PDBsumwww.biochem.ucl.ac.uk/bsm/pdbsumwww.biochem.ucl.ac.uk/bsm/pdbsum CATH www.biochem.ucl.ac.uk/bsm/cath www.biochem.ucl.ac.uk/bsm/cath SCOPhttp://scop.mrc-lmb.cam.ac.uk/scop/http://scop.mrc-lmb.cam.ac.uk/scop/ FSSPwww.embl-ebi.ac.uk/dali/fsspwww.embl-ebi.ac.uk/dali/fssp

15 15 Molecular interaction databases General –Biomolecular Interaction Network Databasehttp://bioinfo.mshri.on.ca/cgi-bin/bind/datamanhttp://bioinfo.mshri.on.ca/cgi-bin/bind/dataman –Molecular interactions Database (MINT) http://cbm.bio.uniroma2.it/mint/http://cbm.bio.uniroma2.it/mint/ Protein-Protein interactions –Database of interacting proteins http://dip.doe-mbi.ucla.edu/http://dip.doe-mbi.ucla.edu/ Biochemical pathways –KEGG Metabolic Pathwayshttp://www.genome.ad.jp/kegg/metabolism.htmlhttp://www.genome.ad.jp/kegg/metabolism.html

16 16 Proteomics databases Yeast Proteome Database http://www.incyte.com/sequence/proteome/databases/YPD.shtml http://www.incyte.com/sequence/proteome/databases/YPD.shtml SWISS-2DPAGE http://us.expasy.org/ch2d/ TMIG-2DPAGE http://proteome.tmig.or.jp/2D/

17 17 Genome databases Entrez genomes www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genomewww.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome Ensemble genomes http://www.ensembl.org/http://www.ensembl.org/ HIV Sequence Database http://hiv-web.lanl.gov/content/hiv-db/mainpage.html FlyBasehttp://flybase.bio.indiana.edu/http://flybase.bio.indiana.edu/ COGswww.ncbi.nlm.nih.gov/COGwww.ncbi.nlm.nih.gov/COG

18 18 Integrated databases Increasing the value of information InterProwww.ebi.ac.uk/interprowww.ebi.ac.uk/interpro Sequence retrieval system (SRS) www.expasy.ch/srs5www.expasy.ch/srs5 Entrezwww.ncbi.nlm.nih.gov/Entrezwww.ncbi.nlm.nih.gov/Entrez

19 19 Entrez Journals UniGene PubMedNucleotide Protein SNP Genome BooksProbeSet OMIM CDD Taxonomy 3D Domains UniSTS PopSet Structure The (ever) Expanding Entrez System

20 20 EBI services http://www.ebi.ac.uk/services/index.html

21 EBI GenBank DDBJ EMBL EMBL Entrez SRS getentry NIG CIB NCBI NIH Submissions Updates Submissions Updates Submissions Updates The International Sequence Database Collaboration

22 22 A closer look at GenBank Maintained by NCBI Accessed through Entrez Entrez Synchonized with DDBJ and EMBL

23 23 Sequence file formats Ideally – a stringent, easy to parse, specified format to facilitate the dissemination of information Reality – a plethora of coincidental and badly specified formats Different levels of information Some common formats –FASTA –GenBank –PHYLIP (PHYLIP package and others) –Nexus (PAUP package, MacClade and others) –Up and coming: XML Simple – sequence and name attribute Advanced – several attributes

24 24 FASTA format >gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTT GLLLNGSYSENRTQIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKT VLPVTIMAGLVFHSQKYNLRLRQAWCHFPSNWKGAWKEVKEEIVNLP KERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCKMDWFLNYL NNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLE TISKKTYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESI WAAELDRYKLVEITPIGFAPTEVRRYTGGHERQKRVPFVXXXXXXXX XXXXXXXXXXXXXXVQSQHLLAGILQQQKNL LAAVEAQQQMLKLTIWGVK http://www.ncbi.nlm.nih.gov/BLAST/fasta.html

25 25 GenBank format A verbose but very informative format Contains much information in carefully specified format Harder to parse than FASTA http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

26 26 eXtensible Markup Language (XML) Markup language for data-representation – derived from SGML, sib of HTML Stringent simple language with rigid rules Human readable and versatile Good parsers exists for multiple platforms The ability to design own Document Type Definitions that parsers can use to validate a document permits complex data structures and grammars Examples of use for sequence data: –NCBI GBSeqXMLNCBI GBSeqXML –NCBI TinySeqXMLNCBI TinySeqXML

27 27 Links http://www.ncbi.nlm.nih.gov/Education/ http://www.infobiogen.fr/services/dbcat/ http://www.science.gmu.edu/~ntongvic/Bioinformatics/database.html http://www.hgmp.mrc.ac.uk/GenomeWeb/prot-interaction.html http://www.no.embnet.org/Programs/DB/srs.php3


Download ppt "1 Databases in Bioinformatics (Roald Forsberg). 2 Overview The role of databases in bioinformatics The structure of databases –Relational databases –Database."

Similar presentations


Ads by Google