5PubMed is… National Library of Medicine's search service 16 million citations in MEDLINElinks to participating online journalsPubMed tutorial (via “Education” on side bar)Page 24
6Entrez integrates… the scientific literature; DNA and protein sequence databases;3D protein structure data;population study data sets;assemblies of complete genomesPage 24
7Entrez is a search and retrieval system that integrates NCBI databases Page 24
8BLAST is… Basic Local Alignment Search Tool NCBI's sequence similarity search toolsupports analysis of DNA and protein databases100,000 searches per dayPage 25
9OMIM is… Online Mendelian Inheritance in Man catalog of human genes and genetic disordersedited by Dr. Victor McKusick, others at JHUPage 25
10Cancer ChromosomesContains cytogenetic, clinical, and reference information from integrated information from the NCI Mitelman Database of Chromosome Aberrations in Cancer, the NCI Recurrent Aberrations in Cancer database, and the NCI/NCBI SKY/M-FISH & CGH Database.
11CDDConserved Domain Database, a collection of sequence alignments and profiles representing protein domains conserved in molecular evolution. Select 'Domains' from the Entrez pull down menu.
12CoreNucleotide Contains all nucleotide sequences not included in the EST or GSS subsets. 3D Domains Contains protein domains from the Entrez Structure database.EST A Nucleotide database subset that contains only Expressed Sequence Tag records.Gene Genes and associated information for a number of organisms in addition to and including human.
13Genome Genomes of over 1,200 organisms can be found in this database, representing both completely sequenced organisms and those for which sequencing is in progress.Genome Project A searchable collection of complete and incomplete (in-progress) large-scale sequencing, assembly, annotation, and mapping projects for cellular organisms.dbGaP Associated genotype and phenotype data.GENSAT Gene expression atlas of the mouse central nervous system.
14GEO Datasets Curated gene expression and molecular abundance DataSets from NCBI's Gene Expression Omnibus, a gene expression and hybridization array repository.GEO Profiles Individual gene expression and molecular abundance profiles assembled from the GEO repository.
15Books is…searchable resource of on-line booksPage 26
16TaxBrowser is… browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses)taxonomy information such as genetic codesmolecular data on extinct organismsPage 26
17Structure site includes… Molecular Modelling Database (MMDB)biopolymer structures obtained fromthe Protein Data Bank (PDB)Cn3D (a 3D-structure viewer)vector alignment search tool (VAST)Page 26
18Accessing information on molecular sequences Page 26
19Accession numbers are labels for sequences NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences.You may want to acquire information beginning with aquery such as the name of a protein of interest, or theraw nucleotides comprising a DNA sequence of interest.DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequenceor other record relevant to molecular data.Page 26
20What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence.Examples (all for retinol-binding protein, RBP4):X GenBank genomic DNA sequenceNT_ Genomic contigRs dbSNP (single nucleotide polymorphism)N An expressed sequence tag (1 of 170)NM_ RefSeq DNA sequence (from a transcript)NP_ RefSeq proteinAAC02945 GenBank proteinQ SwissProt protein1KT7 Protein Data Bank structure recordDNARNAproteinPage 27
21Four ways to access DNA and protein sequences Entrez Gene with RefSeq UniGene European Bioinformatics Institute (EBI)and Ensembl (separate from NCBI) ExPASy Sequence Retrieval System(separate from NCBI)Note: LocusLink at NCBI was recently retired.The third printing of the book has updatedthese sections (pages 27-31).Page 27
224 ways to access protein and DNA sequences  Entrez Gene with RefSeqEntrez Gene is a great starting point: it collectskey information on each gene/protein frommajor databases. It covers all major organisms.RefSeq provides a curated, optimal accession number for each DNA (NM_006744)or protein (NP_007635)Page 27
23From the NCBI home page, type “rbp4” and hit “Go” revised Fig. 2.7
35What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence.Examples (all for retinol-binding protein, RBP4):X GenBank genomic DNA sequenceNT_ Genomic contigRs dbSNP (single nucleotide polymorphism)N An expressed sequence tag (1 of 170)NM_ RefSeq DNA sequence (from a transcript)NP_ RefSeq proteinAAC02945 GenBank proteinQ SwissProt protein1KT7 Protein Data Bank structure recordDNARNAproteinPage 27
36NCBI’s important RefSeq project: best representative sequences RefSeq (accessible via the main page of NCBI)provides an expertly curated accession number thatcorresponds to the most stable, agreed-upon “reference”version of a sequence.RefSeq identifiers include the following formats:Complete genome NC_######Complete chromosome NC_######Genomic contig NT_######mRNA (DNA format) NM_###### e.g. NM_006744Protein NP_###### e.g. NP_006735Page 29-30
37NCBI’s RefSeq project: accession for genomic, mRNA, protein sequences Accession Molecule Method NoteAC_ Genomic Mixed Alternate complete genomicAP_ Protein Mixed Protein products; alternateNC_ Genomic Mixed Complete genomic moleculesNG_ Genomic Mixed Incomplete genomic regionsNM_ mRNA Mixed Transcript products; mRNANM_ mRNA Mixed Transcript products; 9-digitNP_ Protein Mixed Protein products;NP_ Protein Curation Protein products; 9-digitNR_ RNA Mixed Non-coding transcriptsNT_ Genomic Automated Genomic assembliesNW_ Genomic Automated Genomic assembliesNZ_ABCD Genomic Automated Whole genome shotgun dataXM_ mRNA Automated Transcript productsXP_ Protein Automated Protein productsXR_ RNA Automated Transcript productsYP_ Protein Auto. & Curated Protein productsZP_ Protein Automated Protein products
38Four ways to access DNA and protein sequences Entrez Gene with RefSeq UniGene European Bioinformatics Institute (EBI)and Ensembl (separate from NCBI) ExPASy Sequence Retrieval System(separate from NCBI)Page 31
40UniGene: unique genes via ESTs • Find UniGene at NCBI:UniGene clusters contain many expressed sequencetags (ESTs), which are DNA sequences (typically500 base pairs in length) corresponding to the mRNAfrom an expressed gene. ESTs are sequenced from acomplementary DNA (cDNA) library.• UniGene data come from many cDNA libraries.Thus, when you look up a gene in UniGeneyou get information on its abundanceand its regional distribution.Pages 20-21
41Cluster sizes in UniGene This is a gene with1 EST associated;the cluster size is 1Fig. 2.3Page 23
42Cluster sizes in UniGene This is a gene with10 ESTs associated;the cluster size is 10
43Cluster sizes in UniGene (human) Cluster size (ESTs) Number of clusters1 42,8002 6,500,500,400,100,300 ,128 ,000 2116,000-30,000 8UniGene build 194, 8/06
44UniGene: unique genes via ESTs Conclusion: UniGene is a useful tool to look upinformation about expressed genes. UniGenedisplays information about the abundance of atranscript (expressed gene), as well as its regionaldistribution of expression (e.g. brain vs. liver).We will discuss UniGene further later(gene expression).Page 31
45Five ways to access DNA and protein sequences Entrez Gene with RefSeq UniGene European Bioinformatics Institute (EBI)and Ensembl (separate from NCBI) ExPASy Sequence Retrieval System(separate from NCBI)Page 31
46Ensembl to access protein and DNA sequences Try Ensembl at for a premierhuman genome web browser.We will encounter Ensembl as we study the human genome,BLAST, and other topics.
50Five ways to access DNA and protein sequences Entrez Gene with RefSeq UniGene European Bioinformatics Institute (EBI)and Ensembl (separate from NCBI) ExPASy Sequence Retrieval System(separate from NCBI)Page 33
51ExPASy to access protein and DNA sequences ExPASy sequence retrieval system(ExPASy = Expert Protein Analysis System)VisitPage 33
56Following the “genome” link yields a manageable three results Searching for HIV-1 pol:Following the “genome” link yieldsa manageable three resultsPage 34
57Example of how to access sequence data: HIV-1 polFor the Entrez query: hiv-1 polthere are about 40,000 nucleotide or protein records(and >100,000 records for a search for “hiv-1”),but these can easily be reduced in two easy steps:--specify the organism, e.g. hiv-1[organism]--limit the output to RefSeq!Page 34
59Examples of how to access sequence data: histonequery for “histone” # resultsprotein recordsRefSeq entriesRefSeq (limit to human) 1108NOT deacetylase 697At this point, select a reasonable candidate (e.g.histone 2, H4) and follow its link to Entrez Gene.There, you can confirm you have the right gene/protein.
63PubMed is the NCBI gateway to MEDLINE. MEDLINE contains bibliographic citationsand author abstracts from over 4,600 journalspublished in the United States and in 70 foreigncountries.It has >14 million records dating back to 1966.Page 35
64MeSH is the acronym for "Medical Subject Headings." MeSH is the list of the vocabulary terms usedfor subject analysis of biomedical literature at NLM.MeSH vocabulary is used for indexing journal articlesfor MEDLINE.The MeSH controlled vocabulary imposes uniformityand consistency to the indexing of biomedical literature.Page 35
67PubMed search strategies Try the tutorial (“education” on the left sidebar)Use boolean queries (capitalize AND, OR, NOT)lipocalin AND diseaseTry using “limits”Try “Links” to find Entrez information and external resourcesObtain articles on-line via Welch Medical Library(and download pdf files):Page 35
681 AND 212lipocalin AND disease(60 results)1 OR 212lipocalin OR disease(1,650,000 results)1 NOT 212lipocalin NOT disease(530 results)Fig. 2.12Page 348/04
69“globin” is present “globin” is absent Article contents:“globin” ispresent“globin” isabsentSearch result:false positive(article does notdiscuss globins)“globin” isfoundtrue positivefalse negative(article discussesglobins)“globin” isnot foundtrue negative8/06
70WelchWeb is available at http://www.welch.jhu.edu
71Brian Brown (email@example.com) and Brian Brown andCarrie Iwema are theWelch Medical Library liasons to the basic sciences