Presentation is loading. Please wait.

Presentation is loading. Please wait.

Databases (“knowledge bases”) used in genome analysis

Similar presentations

Presentation on theme: "Databases (“knowledge bases”) used in genome analysis"— Presentation transcript:

1 Databases (“knowledge bases”) used in genome analysis
Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA

2 Growth in genome sequencing

3 Working Draft Sequence

4 J. Smith - a very common name
Structure - a very common term Glutamine amidotransferase - less common term but not a very good descriptor

5 A different professor Janet Smith
Another Janet Smith in the news

6 Glutamine for sale

7 Tools of trade for the “armchair scientist”
Databases PubMed and other NCBI databases Biochemical databases Protein domain databases Structural databases Genome comparison databases Tools CDD / COGs VAST / FSSP

8 Types of databases Archival or Primary Data Curated or Processed Data
Text: PubMed DNA Sequence: GenBank Protein Sequence: Entrez Proteins, TREMBL Protein Structures: PDB Curated or Processed Data DNA sequences : RefSeq, LocusLink, OMIM Protein Sequences: SWISS-PROT, PIR Protein Structures : SCOP, CATH, MMDB Genomes: Entrez Genomes, COGs Nucleic Acids Research: Database Issue each January 1 Articles on ~100 different databases


10 The National Center for Biotechnology Information (NCBI)
Created as a part of the National Library of Medicine, National Institutes of Health in 1988 Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate biomedical information Tools: BLAST(1990), Entrez (1992) GenBank (1992) Free MEDLINE (PubMed, 1997) Other databases: dbEST, dbGSS, dbSTS, MMDB, OMIM, UniGene, Taxonomy, GeneMap, SAGE, LocusLink, RefSeq

11 What is GenBank? Archival nucleotide sequence database Sample slogans: “Easy deposits, unlimited withdrawals, high interest”, “All bases covered”, “Billions and billions served” Data are shared nightly among three collaborating databases: GenBank at NCBI - Bethesda, Maryland, USA DNA Database of Japan (DDBJ) at NIG - Mishima, Japan European Molecular Biology Laboratory Database (EMBL) at EBI - Hinxton, UK

12 Some guiding principles of working with GenBank
GenBank is a nucleotide-centric view of the information space GenBank is a repository of all publically available sequences In GenBank, records are grouped for various reasons Data in GenBank is only as good as what you put in

13 NCBI databases and their links
Word Weight VAST BLAST Phylogeny Article Abstracts Medline 3-D Structure 3 D Structure Taxonomy MMDB Genomes Nucleotide Sequences Protein Sequences

14 Entrez: An integrated search and retrieval system






20 PubMed book links

21 GenBank Record Locus Name Accession Number gi Number Medline ID
[rest of protein sequence deleted for brevity] [rest of nucleotide sequence deleted for brevity] gi Number Medline ID Protein Sequence GenPept ID Nucleotide Sequence



24 Archival databases are unreliable
Misinterpreted experimental results Annotations base on low similarity gi| cDNA 5' end similar to similar to arrest- defective protein isolog (H. sapiens) gi| very hypothetical protein (S. pombe) Biologically senseless annotations Deinococcus: head morphogenesis protein Arabidopsis: separation anxiety protein-like Yersinia: automembrane protein H H. pylori - brute force protein S. cerevisiae - inside intron 7 Propagated mistakes of sequence comparison (e.g. ABC1/ABC)

25 Advanced Neighbors: BLink

26 BLink












38 Protein sequence motif is a descriptor of a protein family
Glutamine amidotransferase class I [PAS]-[LIVMFYT]-[LIVMFY]-G-[LIVMFY]-C- [LIVMFYN]-G-x-[QEH]- x-[LIVMFA] [C is the active site residue] Glutamine amidotransferase class II <x(0,11)-C-[GS]-[IV]-[LIVMFYW]-[AG]









47 purF gene neighbors

48 Searching MMDB

49 Principles of structural alignment
Dali: Looks for minimal RMSD between Ca atoms. Calculate Ca - Ca distance matrices, then identifies the longest alignable segments VAST (Vector Alignment Search Tool) looks for pairs of secondary structure elements (a-helices, b-strands) that have similar orientation and connectivity

50 Dali alignment of Tyr phosphatase

51 VAST Structure Neighbors

52 Structure Summary BLAST neighbors VAST neighbors Cn3D viewer

53 Cn3D : Displaying Structures

54 Structure Neighbors

55 Use of structural alignments
Chloroquine NADH

56 Online Mendelian Inheritance in Man
A catalog of human genes and genetic disorders

57 OMIM record for Presenilin 1 (PSEN1)
Contents Additional info in OMIM Each record provides a state of the art summary of current knowledge Associated LocusLink record External resources Extensive references to literature

58 OMIM Search Results by Titles
alzheimer AND presenilin 1

59 Entrez Genome: Gene Location
View of chromosome 14 Multiple Maps STSs, ESTs, etc. Gene Name

60 Integrated View of Chromosome 7
Entrez Genomes Map Viewer Chromosome 7 GenBank Map Contig Map STS Map Multiple Maps STSs, ESTs, etc.

61 Entrez Genome: Gene Location
View of chromosome 14 Gene Name

62 Entrez Genome: Gene Location
Entrez Genomes Map Viewer Chromosome 14 Cytogenetic map Location of PSEN1 and surrounding genes

63 LocusLink

64 LocusLink Multiple Organisms Text querying Alphabetical listings
alzheimer Curated Resource Central hub of information for human, mouse, rat, zebrafish, and fruit fly loci Text querying Alphabetical listings Approved symbol Stable Locus ID Description Genome Position External Links

65 LocusLink RefSeq GenBank OMIM UniGene dbSNP

66 LocusLink: LocusID 5663 PSEN1

67 National Center for Biotechnology Information
Directed by Dr. David J. Lipman

Download ppt "Databases (“knowledge bases”) used in genome analysis"

Similar presentations

Ads by Google