2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
Published byModified over 5 years ago
Presentation on theme: "2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT."— Presentation transcript:
2010.09-28 IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT
2010.09-28 IST Computational Biology2 Sizing Biological Information This week (20 Sept. 2010) the EMBL Database contained 298+10 9 nucleotides in 195,945,264 entries.
2010.09-28 IST Computational Biology3 Sizing Biological Information Release 2010_09 of 10-Aug-10 of UniProtKB/Swiss-Prot contains 519348 sequence entries, comprising 183273162 amino acids abstracted from 191032 references. 998 sequences have been added since release 2010_08, the sequence data of 160 existing entries has been updated and the annotations of 480770 entries have been revised. Protein existence (PE): entries % Evidence at protein level7051413.6% Evidence at transcript level6719212.9% Inferred from homology 365712 70.4% Predicted14317 2.8% Uncertain 1613 0.3%
2010.09-28 IST Computational Biology4 Sizing Biological Information
2010.09-28 IST Computational Biology5 Sizing Biological Information
2010.09-28 IST Computational Biology6 Protein Structures X-RAY 5907 2 NMR8588 ELECTRON MICROSCOPY306 HYBRID26 other147 Total 6813 9 RSCB - PDB 19802010
2010.09-28 IST Computational Biology7 Data deluge, where from Sequencing (NGS, SMS) Microarray experiments Parallelized drug screening and testing Other
2010.09-28 IST Computational Biology8 Gene Ontology – towards consistent descriptions The need to produce consistent effective searches Uniform terminology Controlled vocabulary Hierarchical relations
2010.09-28 IST Computational Biology9 Gene Ontology
2010.09-28 IST Computational Biology10 Specialized Search tools Searching on specific fields is relatively easy Using keywords allows indexed searching on text fields Searching sequence data is more complex Similarity search: BLAST is a fast way of searching sequence data for similarity Some databases of nucleotide or protein sequences are formatted for BLAST
2010.09-28 IST Computational Biology11 Interoperability Adherence to standards Minimal experiment descriptions Ontological concerns Integration Warehousing
2010.09-28 IST Computational Biology12 Bibliography DBs Pubmed (Medline) “Entrez” searching Data Mining in text Tagged text to avoid loss (Utopia doucuments).
2010.09-28 IST Computational Biology13 Medical Subject Headings Part of the NLM/Pubmed effort. MESH is a seacheable database. Controlled Vocabulary Disambiguation Term relationships Spelling:Hemoglobin or Haemoglobin? Context:NMR spectrocopy or imaging?
2010.09-28 IST Computational Biology14 More on bibliography Web of knowledge b-on Institutional repositories PubCrawler (alerts) http://www.pubcrawler.ie
2010.09-28 IST Computational Biology15 Structural Protein DBs Primary Coordinates from X-ray diffraction, NMR, etc Composition from UniprotKB Properties from annotations
2010.09-28 IST Computational Biology16 Specialized DBs Binding sites SNPs
2010.09-28 IST Computational Biology17 Classification of Proteins CATH Classification, Architecture, Topology, Homology http://www.biochem.ucl.ac.uk/bsm/cath_new/ SCOP Structural Classification of Proteins http://scop.mrc-lmb.cam.ac.uk/scop/
2010.09-28 IST Computational Biology18 Integrated DBs Built to aggregate other databases Provide common search Calculate cross linking tables Interpro http://www.ebi.ac.uk/interpro –Results from integrating several derivative databases such as PRINTS; PROSITE; SMART; ProDom; Pfam; TIGRfam
2010.09-28 IST Computational Biology20 GeneCards
2010.09-28 IST Computational Biology21 GeneCards
2010.09-28 IST Computational Biology22 GeneCards – expression data
2010.09-28 IST Computational Biology23 Clinical OMIM Mendelian inheritance, human diseases HGMD Mutations and associated human diseases dbSNP SNPs in >1% incidence
2010.09-28 IST Computational Biology24 The synchronization issue Many copies of public databases (version control) Content update on primary and derived databases influences integration Inconsistencies are slow to resolve Indexes need frequent recalculation
2010.09-28 IST Computational Biology25 Purifying content Efforts are in place to enhance contents of derived databases For example, manual curation of genomic databases in specific sectors, such as eukariots, human, plants, etc.
2010.09-28 IST Computational Biology26 HAVANA Manual annotation by chromosome in human genome.
2010.09-28 IST Computational Biology27 ENCODE Project to review functional parts of the human genome in fine detail