Presentation is loading. Please wait.

Presentation is loading. Please wait.

Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

Similar presentations


Presentation on theme: "Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases."— Presentation transcript:

1 Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases (“knowledge bases”) used in genome analysis

2 Growth in genome sequencing

3 Working Draft Sequence gaps

4 J. Smith - a very common name Structure - a very common term Glutamine amidotransferase - less common term but not a very good descriptor

5 A different professor Janet Smith Another Janet Smith in the news

6 Glutamine for sale

7 Databases –PubMed and other NCBI databases –Biochemical databases –Protein domain databases –Structural databases –Genome comparison databases Tools –CDD / COGs –VAST / FSSP Tools of trade for the “armchair scientist”

8 Archival or Primary Data –Text: PubMed –DNA Sequence: GenBank –Protein Sequence: Entrez Proteins, TREMBL –Protein Structures: PDB Curated or Processed Data –DNA sequences : RefSeq, LocusLink, OMIM –Protein Sequences: SWISS-PROT, PIR –Protein Structures : SCOP, CATH, MMDB –Genomes: Entrez Genomes, COGs Types of databases Nucleic Acids Research: Database Issue each January 1 Articles on ~100 different databases

9

10 The National Center for Biotechnology Information (NCBI) Created as a part of the National Library of Medicine, National Institutes of Health in 1988 –Establish public databases –Research in computational biology –Develop software tools for sequence analysis –Disseminate biomedical information Tools: BLAST(1990), Entrez (1992) GenBank (1992) Free MEDLINE (PubMed, 1997) Other databases: dbEST, dbGSS, dbSTS, MMDB, OMIM, UniGene, Taxonomy, GeneMap, SAGE, LocusLink, RefSeq

11 What is GenBank? Archival nucleotide sequence database Sample slogans: “ Easy deposits, unlimited withdrawals, high interest”, “All bases covered”, “Billions and billions served” Data are shared nightly among three collaborating databases: GenBank at NCBI - Bethesda, Maryland, USA DNA Database of Japan (DDBJ) at NIG - Mishima, Japan European Molecular Biology Laboratory Database European Molecular Biology Laboratory Database (EMBL) at EBI - Hinxton, UK

12 Some guiding principles of working with GenBank GenBank is a nucleotide-centric view of the information space GenBank is a repository of all publically available sequences In GenBank, records are grouped for various reasons Data in GenBank is only as good as what you put in

13 NCBI databases and their links Word Weight VAST BLASTBLAST Phylogeny Genomes Taxonomy Nucleotide Sequences Protein Sequences Article Abstracts Medline 3-D Structure 3 D Structure MMDB

14 Entrez: An integrated search and retrieval system

15

16

17

18

19

20 PubMed book links

21 [ rest of protein sequence deleted for brevity] [rest of nucleotide sequence deleted for brevity] GenBank Record Accession Number gi Number Protein Sequence Nucleotide Sequence Locus Name Medline ID GenPept ID

22

23

24 Archival databases are unreliable Misinterpreted experimental results Annotations base on low similarity gi| cDNA 5' end similar to similar to arrest- defective protein isolog (H. sapiens) gi| very hypothetical protein (S. pombe) Biologically senseless annotations Deinococcus: head morphogenesis protein Arabidopsis: separation anxiety protein-like Yersinia: automembrane protein H H. pylori - brute force protein S. cerevisiae - inside intron 7 Propagated mistakes of sequence comparison (e.g. ABC1/ABC)

25 Advanced Neighbors: BLink

26 BLink

27

28

29

30

31

32

33

34

35

36

37

38 Protein sequence motif is a descriptor of a protein family Glutamine amidotransferase class I [PAS]-[LIVMFYT]-[LIVMFY]-G-[LIVMFY]-C- [LIVMFYN]-G-x-[QEH]- x-[LIVMFA] [C is the active site residue] Glutamine amidotransferase class II

39

40

41

42

43

44

45

46

47 purF gene neighbors

48 Searching MMDB

49 Principles of structural alignment Dali: Looks for minimal RMSD between C  atoms. Calculate C  - C  distance matrices, then identifies the longest alignable segments VAST (Vector Alignment Search Tool) looks for pairs of secondary structure elements (  -helices,  -strands) that have similar orientation and connectivity

50 Dali alignment of Tyr phosphatase

51 VAST Structure Neighbors

52 Structure Summary Cn3D viewer VAST neighbors BLAST neighbors

53 Cn3D : Displaying Structures Chloroquine

54 Structure Neighbors

55 Use of structural alignments Chloroquine NADH

56 A catalog of human genes and genetic disorders Online Mendelian Inheritance in Man

57 OMIM record for Presenilin 1 (PSEN1) Associated LocusLink record External resources Additional info in OMIM Content s Each record provides a state of the art summary of current knowledge Extensive references to literature

58 OMIM Search Results by Titles alzheimer AND presenilin 1

59 Entrez Genome: Gene Location View of chromoso me 14 Gene Name Multiple Maps STSs, ESTs, etc.

60 Entrez Genomes Map Viewer Chromosome 7 GenBank Map Contig Map STS Map Integrated View of Chromosome 7 Multiple Maps STSs, ESTs, etc.

61 Entrez Genome: Gene Location View of chromoso me 14 Gene Name

62 Entrez Genome: Gene Location Entrez Genomes Map Viewer Chromosome 14 Cytogenetic map Location of PSEN1 and surrounding genes

63 LocusLink

64 LocusLink Text querying Multiple Organisms Alphabetical listings Stable Locus ID Approved symbol Descriptio n Genome Position External Links Curated Resource Central hub of information for human, mouse, rat, zebrafish, and fruit fly loci alzheimer

65 OMIM RefSeq GenBank UniGene dbSNPLocusLink

66 LocusLink: LocusID 5663 PSEN1

67 Directed by Dr. David J. Lipman National Center for Biotechnology Information


Download ppt "Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases."

Similar presentations


Ads by Google