Presentation on theme: "Using Entrez The Life Sciences Search Engine. Searching NCBI Databases Efficiently Knowing how to retrieve the exact information you need in an efficient."— Presentation transcript:
Searching NCBI Databases Efficiently Knowing how to retrieve the exact information you need in an efficient way is the fundamental and most important skill in Bioinformatics. Every NCBI database is designed and created for some specific purposes. A common mistake Bioinformatics novices make is searching for information in an inappropriate database. Entrez links among and within databases, making it easier to search for information.
What is Entrez? Entrez is an NCBI retrieval system designed for searching several linked databases. Entrez is a search tool for integrated access to the biological literature and sequence data. Entrez is extremely powerful, enabling the user to quickly move between the different specialized databases.
Entrez Entrez is divided into sites for nucleotide, protein, structure, genomes, OMIM, and more. You can use limits (such as RefSeq) to focus your Entrez search. When you conduct a search via Entrez, your query generates this screen, telling you the number of hits to your query.
The Big Picture LocusLink Nucleotide Protein OMIM PubMed SNP MGC UCSC GDB e! HGMD UniGene Homologene MapViewer Structure 3D Domains CDD Books PopSet Genome Taxonomy ProbeSet UniSTS Entrez
Entrez and LocusLink Entrez doesn’t link to all the databases that contain sequences, however! LocusLink has its own groups of links to specialty databases, since it doesn’t cover all the genomes yet.
Genomes Taxonomy Entrez: Database Integration PubMed abstracts Nucleotide sequences Protein sequences 3-D Structure Word weight VAST BLAST Phylogeny
Entrez Journals UniGene PubMedNucleotide Protein SNP Genome BooksProbeSet OMIM CDD Taxonomy 3D Domains UniSTS PopSet Structure The (ever) Expanding Entrez System
Entrez Databases PubMedBiomedical literature BooksOnline textbooks NucleotideGenBank, EMBL, DDBJ, RefSeq, PDB Protein[GenBank, EMBL, DDBJ], RefSeq, SWISS-PROT, PIR, PRF, PDB GenomeComplete genomes TaxonomyOrganisms in NCBI sequence databases StructureMMDB: experimental 3D structures DomainsCDD: conserved protein domains 3D DomainsCompact 3D protein domains in MMDB OMIMOnline Mendelian Inheritance in Man SNPSingle nucleotide polymorphisms UniSTSSequence Tagged Site markers ProbeSetGene expression and microarray datasets PopSetPopulation study datasets UniGeneGene-based expressed sequence clusters
Nucleotide Database The Nucleotide database contains sequence data from GenBank, EMBL, and DDBJ, the members of the tripartite, international collaboration of sequence databases. EMBL is the European Molecular Biology Laboratory at Hinxton Hall, UK; DDBJ is the DNA Database of Japan in Mishima, Japan. Sequence data are also incorporated from the Genome Sequence Data Base (GSDB), Santa Fe, NM. Patent sequences are incorporated through arrangements with the U.S. Patent and Trademark Office (USPTO) and via the collaborating international databases from other international patent offices.
Entrez Nucleotides Primary GenBank / EMBL / DDBJ 35,116,960 Derivative RefSeq 259,219 Third Party Annotation 3,182 PDB 4,703 Total 35,384,248
Database Searching with Entrez uUsing limits and field restriction to find plant g6pdh uLinking and neighboring with g6pdh
Entrez Nucleotides glucose 6 phosphate dehydrogenase The G6PD enzyme catalyzes the oxidation of glucose-6- phosphate to 6-phosphogluconate, while reducing nicotinamide adenine dinucleotide phosphate (NADP+ to NADPH). In terms of electron transfer, glucose-6-phosphate loses two electrons to become 6-phosphogluconate and NADP+ gains two electrons to become NADPH. This is the first step in the pentose phosphate pathway. This pathway, or shunt, as it is sometimes called, produces the 5- carbon sugar, ribose, which is an essential component of both DNA and RNA.
Limits Are Helpful Limits allow restriction of a search to a defined subset of the database. Limits can be set to restrict a search to a particular database field (e.g., the Author field). Limits can be set to search everything but a particular type of data (e.g., “exclude patent records”). Alternatively, limits can be set to search only a particular type of data (e.g., Genomic RNA/DNA) or to search only data from a particular source database (e.g., EMBL). Date limits and sequence length limits are also possible. The contents of each Entrez database differ, and therefore the Limits available for each database differ.
glucose 6 phosphate dehydrogenase Entrez Nucleotides: Limits & Preview/Index Try using the Limits and Preview function to hone your search To find the Plant G6PD genes.
glucose 6 phosphate dehydrogenase Entrez Nucleotides: Limits Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Word Uid Volume Field Restriction Exclude bulk sequences
glucose 6 phosphate dehydrogenase Entrez Nucleotides: Limits Title == Definition Exclude Bulk Sequences mRNA molecule type Nuclear gene
green plants Adding Terms: Preview/Index Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Word Uid Volume green plants
Database Neighbors and Interlinking What makes Entrez more powerful than many services is that most of its records are linked to other records, both within a given database (such as Nucleotide) and between databases. Links within a database are called “neighbors” (e.g., Nucleotide neighbors).
Links Between Databases Protein and Nucleotide neighbors are determined by performing similarity searches using the BLAST algorithm to compare the entry amino acid or DNA sequence to all other amino acid or DNA sequences in the database. We will discuss more about BLAST later. Nucleotide sequence records in the Nucleotide database are linked to the PubMed citation of the article in which the sequences were published. Protein sequence records are linked to the nucleotide sequence from which the protein was translated.
Plant cytosolic g6pdh mRNAs Summary Brief GenBank ASN.1 FASTA GI list LinkOut PubMed Links Protein Links Nucleotide Neighbors PopSet Links Structure Links Genome Links Taxonomy Links OMIM Links Formats Links and neighbors (related records)
LinkOut LinkOut is a feature of Entrez that is designed to provide users with links from PubMed and other Entrez databases to a wide variety of relevant web- accessible online resources: –Full-text publications –Other biological databases –Consumer health information –Research tools The goal is to facilitate access to relevant online resources beyond the Entrez system to extend, clarify, or supplement information found in the Entrez databases.
Protein Database The protein database includes proteins from translate regions of DNA in GenBank as well as sequence from PIR The entry includes: –The name of the protein – How the protein sequence was derived –An accession and a PID number –The number of amino acids
Protein Entry The Entry also includes: Structural information for the protein (if known) –Helices and - Sheets –Domains –Etc The sequence of amino acids comprising the protein
Setting Protein Database search limits Choose Protein from the drop-down menu –Can do a Boolean search –Or can set LIMITS Fields (eg Author, Journal, etc.) Gene Location (genomic, mitochondrial etc) Segmented Sequence Only from (Database to check) Modification date
Linking Between Databases Sometimes you will pull up a record and you have no idea what organism the gene you are looking at is from. For Example, the following record- what is Medicago sativa ?
Taxonomy to the Rescue Entrez lets you click a live link from the record and determine what organism Medicago sativa is. It is alfalfa. You can also tell what it is related to taxonomically, because sometimes the common name isn’t very useful either!
What is BLink BLink - BLAST Link Someone has done a BLAST search already, and you can just retrieve it! BLink displays the graphical output of pre- computed blastp results against the protein non-redundant (nr) database.
This graphical output includes: Alignment of up to 200 BLAST hits on the query sequence Best Hits to each organism List of known protein domains in the query sequence Filter hits by selecting the BLAST cutoff score Distribution of hits by taxonomic grouping Display of similar sequences with known 3D structure Filter hits by database and/or by taxonomic grouping Display a taxonomic tree of all organisms with similar sequences
PopSet Links The PopSet database contains aligned sequences submitted as a set resulting from a population, phylogenetic, or mutation study. These alignments describe such events as evolution and population variation. The PopSet database contains both nucleotide and protein sequence data.
Protein Structures can also be in databases http://bmbiris.bmb.uga.edu/wampler/tutorial/prot0.htmlhttp://bmbiris.bmb.uga.edu/wampler/tutorial/prot0.html is a useful review Tutorial.
Entrez links to structure databases The Structure database or Molecular Modeling Database (MMDB) contains experimental data from crystallographic and NMR structure determinations. The data for MMDB are obtained from the Protein Data Bank (PDB). The NCBI has cross-linked structural data to bibliographic information, to the sequence databases, and to the NCBI taxonomy. Use Cn3D, the NCBI 3D structure viewer, for easy interactive visualization of molecular structures from Entrez.
Structure Search results The structure of proteins are also in a database Search as before Your search results are similar
Structure Entry The structure Entry has links to the other databases And it will allow you download a file to open with a structure viewer program
Proteins with similar structures and functions have been identified in the databases
MM MMDB: Molecular Modeling Data Base Derived from experimentally determined PDB records Value added to PDB records including: –Addition of explicit chemical graph information –Validation –Inclusion of Taxonomy, Citation, and other information –Conversion to ASN.1 data description language Structure neighbors determined by Vector Alignment Search Tool (VAST)
Other services and databases from the NCBI LocusLink to all possible information from NCBI and beyond for a few well characterized model organisms. LocusLink is a great starting point: it collects key information on each gene/protein from major databases. It now covers 8 organisms. RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635)
Locus Links Results of a Locus links search, includes: –Locus ID –Species –Locus symbol –Locus name –Locus location –Links P rotein Database O MIM R eference Sequence R elated GenBank Sequences H omologene Data U niGene V ariation Data
LocusLink: Selected Higher Genomes OMIM RefSeq GenBank dbSNP UniGene Full report PubMed HomoloGene Map Viewer Protein
Protein Database The Protein database contains sequence data from the translated coding regions from DNA sequences in GenBank, EMBL, and DDBJ as well as protein sequences submitted to: –Protein Information Resource (PIR) –SWISS-PROT –Protein Research Foundation (PRF) –Protein Data Bank (PDB) (sequences from solved structures)
NCBI Protein Databases GenPept GenBank, EMBL, DDBJ CDS translations RefSeq mRNA based (NP_) and genome based (XP_) Swiss-Prot curated high quality protein reviews PIR protein information resource Georgetown University PRF protein resource foundation PDB Protein Databank sequences from structures
Entrez Protein GenPept (GB,EMBL, DDBJ) 3,442,298 RefSeq 856,191 Third Party Annotation 3,834 Swiss Prot 144,508 PIR 282,821 PRF 12,079 Total 3,442,298 BLAST nr 1,642,191
But wait! There’s more! There is even more at NCBI that I have covered here. This site map is also a guide to NCBI resources. Each link leads to a brief description of the resource on this page, then to the resource itself. http://www.ncbi.nlm.nih.gov/Sitemap/
There are many bioinformatics servers outside NCBI. Try ExPASy’s sequence retrieval system at http://www.expasy.ch/ (ExPASy = Expert Protein Analysis System) Or try ENSEMBL at www.ensembl.org for a premier human genome web browser.