5 PubMed is… National Library of Medicine's search service 16 million citations in MEDLINElinks to participating online journalsPubMed tutorial (via “Education” on side bar)Page 24
6 Entrez integrates… the scientific literature; DNA and protein sequence databases;3D protein structure data;population study data sets;assemblies of complete genomesPage 24
7 Entrez is a search and retrieval system that integrates NCBI databases Page 24
8 BLAST is… Basic Local Alignment Search Tool NCBI's sequence similarity search toolsupports analysis of DNA and protein databases100,000 searches per dayPage 25
9 OMIM is… Online Mendelian Inheritance in Man catalog of human genes and genetic disordersedited by Dr. Victor McKusick, others at JHUPage 25
10 Cancer ChromosomesContains cytogenetic, clinical, and reference information from integrated information from the NCI Mitelman Database of Chromosome Aberrations in Cancer, the NCI Recurrent Aberrations in Cancer database, and the NCI/NCBI SKY/M-FISH & CGH Database.
11 CDDConserved Domain Database, a collection of sequence alignments and profiles representing protein domains conserved in molecular evolution. Select 'Domains' from the Entrez pull down menu.
12 CoreNucleotide Contains all nucleotide sequences not included in the EST or GSS subsets. 3D Domains Contains protein domains from the Entrez Structure database.EST A Nucleotide database subset that contains only Expressed Sequence Tag records.Gene Genes and associated information for a number of organisms in addition to and including human.
13 Genome Genomes of over 1,200 organisms can be found in this database, representing both completely sequenced organisms and those for which sequencing is in progress.Genome Project A searchable collection of complete and incomplete (in-progress) large-scale sequencing, assembly, annotation, and mapping projects for cellular organisms.dbGaP Associated genotype and phenotype data.GENSAT Gene expression atlas of the mouse central nervous system.
14 GEO Datasets Curated gene expression and molecular abundance DataSets from NCBI's Gene Expression Omnibus, a gene expression and hybridization array repository.GEO Profiles Individual gene expression and molecular abundance profiles assembled from the GEO repository.
15 Books is…searchable resource of on-line booksPage 26
16 TaxBrowser is… browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses)taxonomy information such as genetic codesmolecular data on extinct organismsPage 26
17 Structure site includes… Molecular Modelling Database (MMDB)biopolymer structures obtained fromthe Protein Data Bank (PDB)Cn3D (a 3D-structure viewer)vector alignment search tool (VAST)Page 26
18 Accessing information on molecular sequences Page 26
19 Accession numbers are labels for sequences NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences.You may want to acquire information beginning with aquery such as the name of a protein of interest, or theraw nucleotides comprising a DNA sequence of interest.DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequenceor other record relevant to molecular data.Page 26
20 What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence.Examples (all for retinol-binding protein, RBP4):X GenBank genomic DNA sequenceNT_ Genomic contigRs dbSNP (single nucleotide polymorphism)N An expressed sequence tag (1 of 170)NM_ RefSeq DNA sequence (from a transcript)NP_ RefSeq proteinAAC02945 GenBank proteinQ SwissProt protein1KT7 Protein Data Bank structure recordDNARNAproteinPage 27
21 Four ways to access DNA and protein sequences Entrez Gene with RefSeq UniGene European Bioinformatics Institute (EBI)and Ensembl (separate from NCBI) ExPASy Sequence Retrieval System(separate from NCBI)Note: LocusLink at NCBI was recently retired.The third printing of the book has updatedthese sections (pages 27-31).Page 27
22 4 ways to access protein and DNA sequences  Entrez Gene with RefSeqEntrez Gene is a great starting point: it collectskey information on each gene/protein frommajor databases. It covers all major organisms.RefSeq provides a curated, optimal accession number for each DNA (NM_006744)or protein (NP_007635)Page 27
23 From the NCBI home page, type “rbp4” and hit “Go” revised Fig. 2.7
35 What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence.Examples (all for retinol-binding protein, RBP4):X GenBank genomic DNA sequenceNT_ Genomic contigRs dbSNP (single nucleotide polymorphism)N An expressed sequence tag (1 of 170)NM_ RefSeq DNA sequence (from a transcript)NP_ RefSeq proteinAAC02945 GenBank proteinQ SwissProt protein1KT7 Protein Data Bank structure recordDNARNAproteinPage 27
36 NCBI’s important RefSeq project: best representative sequences RefSeq (accessible via the main page of NCBI)provides an expertly curated accession number thatcorresponds to the most stable, agreed-upon “reference”version of a sequence.RefSeq identifiers include the following formats:Complete genome NC_######Complete chromosome NC_######Genomic contig NT_######mRNA (DNA format) NM_###### e.g. NM_006744Protein NP_###### e.g. NP_006735Page 29-30
37 NCBI’s RefSeq project: accession for genomic, mRNA, protein sequences Accession Molecule Method NoteAC_ Genomic Mixed Alternate complete genomicAP_ Protein Mixed Protein products; alternateNC_ Genomic Mixed Complete genomic moleculesNG_ Genomic Mixed Incomplete genomic regionsNM_ mRNA Mixed Transcript products; mRNANM_ mRNA Mixed Transcript products; 9-digitNP_ Protein Mixed Protein products;NP_ Protein Curation Protein products; 9-digitNR_ RNA Mixed Non-coding transcriptsNT_ Genomic Automated Genomic assembliesNW_ Genomic Automated Genomic assembliesNZ_ABCD Genomic Automated Whole genome shotgun dataXM_ mRNA Automated Transcript productsXP_ Protein Automated Protein productsXR_ RNA Automated Transcript productsYP_ Protein Auto. & Curated Protein productsZP_ Protein Automated Protein products
38 Four ways to access DNA and protein sequences Entrez Gene with RefSeq UniGene European Bioinformatics Institute (EBI)and Ensembl (separate from NCBI) ExPASy Sequence Retrieval System(separate from NCBI)Page 31
40 UniGene: unique genes via ESTs • Find UniGene at NCBI:UniGene clusters contain many expressed sequencetags (ESTs), which are DNA sequences (typically500 base pairs in length) corresponding to the mRNAfrom an expressed gene. ESTs are sequenced from acomplementary DNA (cDNA) library.• UniGene data come from many cDNA libraries.Thus, when you look up a gene in UniGeneyou get information on its abundanceand its regional distribution.Pages 20-21
41 Cluster sizes in UniGene This is a gene with1 EST associated;the cluster size is 1Fig. 2.3Page 23
42 Cluster sizes in UniGene This is a gene with10 ESTs associated;the cluster size is 10
43 Cluster sizes in UniGene (human) Cluster size (ESTs) Number of clusters1 42,8002 6,500,500,400,100,300 ,128 ,000 2116,000-30,000 8UniGene build 194, 8/06
44 UniGene: unique genes via ESTs Conclusion: UniGene is a useful tool to look upinformation about expressed genes. UniGenedisplays information about the abundance of atranscript (expressed gene), as well as its regionaldistribution of expression (e.g. brain vs. liver).We will discuss UniGene further later(gene expression).Page 31
45 Five ways to access DNA and protein sequences Entrez Gene with RefSeq UniGene European Bioinformatics Institute (EBI)and Ensembl (separate from NCBI) ExPASy Sequence Retrieval System(separate from NCBI)Page 31
46 Ensembl to access protein and DNA sequences Try Ensembl at for a premierhuman genome web browser.We will encounter Ensembl as we study the human genome,BLAST, and other topics.
50 Five ways to access DNA and protein sequences Entrez Gene with RefSeq UniGene European Bioinformatics Institute (EBI)and Ensembl (separate from NCBI) ExPASy Sequence Retrieval System(separate from NCBI)Page 33
51 ExPASy to access protein and DNA sequences ExPASy sequence retrieval system(ExPASy = Expert Protein Analysis System)VisitPage 33
56 Following the “genome” link yields a manageable three results Searching for HIV-1 pol:Following the “genome” link yieldsa manageable three resultsPage 34
57 Example of how to access sequence data: HIV-1 polFor the Entrez query: hiv-1 polthere are about 40,000 nucleotide or protein records(and >100,000 records for a search for “hiv-1”),but these can easily be reduced in two easy steps:--specify the organism, e.g. hiv-1[organism]--limit the output to RefSeq!Page 34
58 over 100,000nucleotide entriesfor HIV-1only 1 RefSeq
59 Examples of how to access sequence data: histonequery for “histone” # resultsprotein recordsRefSeq entriesRefSeq (limit to human) 1108NOT deacetylase 697At this point, select a reasonable candidate (e.g.histone 2, H4) and follow its link to Entrez Gene.There, you can confirm you have the right gene/protein.
63 PubMed is the NCBI gateway to MEDLINE. MEDLINE contains bibliographic citationsand author abstracts from over 4,600 journalspublished in the United States and in 70 foreigncountries.It has >14 million records dating back to 1966.Page 35
64 MeSH is the acronym for "Medical Subject Headings." MeSH is the list of the vocabulary terms usedfor subject analysis of biomedical literature at NLM.MeSH vocabulary is used for indexing journal articlesfor MEDLINE.The MeSH controlled vocabulary imposes uniformityand consistency to the indexing of biomedical literature.Page 35
67 PubMed search strategies Try the tutorial (“education” on the left sidebar)Use boolean queries (capitalize AND, OR, NOT)lipocalin AND diseaseTry using “limits”Try “Links” to find Entrez information and external resourcesObtain articles on-line via Welch Medical Library(and download pdf files):Page 35
68 1 AND 212lipocalin AND disease(60 results)1 OR 212lipocalin OR disease(1,650,000 results)1 NOT 212lipocalin NOT disease(530 results)Fig. 2.12Page 348/04
69 “globin” is present “globin” is absent Article contents:“globin” ispresent“globin” isabsentSearch result:false positive(article does notdiscuss globins)“globin” isfoundtrue positivefalse negative(article discussesglobins)“globin” isnot foundtrue negative8/06
70 WelchWeb is available at http://www.welch.jhu.edu
71 Brian Brown (email@example.com) and Brian Brown andCarrie Iwema are theWelch Medical Library liasons to the basic sciences