Presentation on theme: "The Life Sciences Search Engine"— Presentation transcript:
1 The Life Sciences Search Engine Using EntrezThe Life Sciences Search Engine
2 Searching NCBI Databases Efficiently Knowing how to retrieve the exact information you need in an efficient way is the fundamental and most important skill in Bioinformatics.Every NCBI database is designed and created for some specific purposes.A common mistake Bioinformatics novices make is searching for information in an inappropriate database.Entrez links among and within databases, making it easier to search for information.
3 What is Entrez?Entrez is an NCBI retrieval system designed for searching several linked databases.Entrez is a search tool for integrated access to the biological literature and sequence data.Entrez is extremely powerful, enabling the user to quickly move between the different specialized databases.
4 EntrezEntrez is divided into sites for nucleotide, protein, structure, genomes, OMIM, and more. You can use limits (such as RefSeq) to focus your Entrez search.When you conduct a search via Entrez, your query generates this screen, telling you the number of hits to your query.
9 The (ever) Expanding Entrez System PubMedNucleotideEntrezUniGeneProteinJournalsStructureCDDGenomePopSetSNPOMIM3D DomainsTaxonomyUniSTSProbeSetBooks
10 Entrez Databases PubMed Biomedical literature Books Online textbooks Nucleotide GenBank, EMBL, DDBJ, RefSeq, PDBProtein [GenBank, EMBL, DDBJ], RefSeq,SWISS-PROT, PIR, PRF, PDBGenome Complete genomesTaxonomy Organisms in NCBI sequence databasesStructure MMDB: experimental 3D structuresDomains CDD: conserved protein domains3D Domains Compact 3D protein domains in MMDBOMIM Online Mendelian Inheritance in ManSNP Single nucleotide polymorphismsUniSTS Sequence Tagged Site markersProbeSet Gene expression and microarray datasetsPopSet Population study datasetsUniGene Gene-based expressed sequence clusters
11 Nucleotide DatabaseThe Nucleotide database contains sequence data from GenBank, EMBL, and DDBJ, the members of the tripartite, international collaboration of sequence databases.EMBL is the European Molecular Biology Laboratory at Hinxton Hall, UK;DDBJ is the DNA Database of Japan in Mishima, Japan.Sequence data are also incorporated from the Genome Sequence Data Base (GSDB), Santa Fe, NM.Patent sequences are incorporated through arrangements with the U.S. Patent and Trademark Office (USPTO) and via the collaborating international databases from other international patent offices.
12 Entrez Nucleotides Primary GenBank / EMBL / DDBJ 35,116,960 Derivative RefSeq ,219Third Party Annotation ,182PDB ,703Total ,384,248These slides are obtained and/or modified from the ncbi website “field guide to ncbi” clides and from
13 Database Searching with Entrez Using limits and field restriction to find plant g6pdhLinking and neighboring with g6pdh
14 glucose 6 phosphate dehydrogenase Entrez Nucleotidesglucose 6 phosphate dehydrogenaseThe G6PD enzyme catalyzes the oxidation of glucose-6-phosphate to 6-phosphogluconate, while reducing nicotinamide adenine dinucleotide phosphate (NADP+ to NADPH). In terms of electron transfer, glucose-6-phosphate loses two electrons to become 6-phosphogluconate and NADP+ gains two electrons to become NADPH. This is the first step in the pentose phosphate pathway. This pathway, or shunt, as it is sometimes called, produces the 5- carbon sugar, ribose, which is an essential component of both DNA and RNA.
15 You will find MANY hits for this and can’t easily tell which ones are plants.
16 Limits Are HelpfulLimits allow restriction of a search to a defined subset of the database.Limits can be set to restrict a search to a particular database field (e.g., the Author field).Limits can be set to search everything but a particular type of data (e.g., “exclude patent records”).Alternatively, limits can be set to search only a particular type of data (e.g., Genomic RNA/DNA) or to search only data from a particular source database (e.g., EMBL). Date limits and sequence length limits are also possible.The contents of each Entrez database differ, and therefore the Limits available for each database differ.
17 Entrez Nucleotides: Limits & Preview/Index glucose 6 phosphate dehydrogenaseTry using the Limits and Preview function to hone your searchTo find the Plant G6PD genes.
18 Entrez Nucleotides: Limits AccessionAll FieldsAuthor NameEC/RN NumberFeature keyFilterGene NameIssueJournal NameKeywordModification DateOrganismPage NumberPrimary AccessionPropertiesProtein NamePublication DateSeqID StringSequence LengthSubstance NameText WordTitle WordUidVolumeField Restrictionglucose 6 phosphate dehydrogenaseExclude bulk sequencesYou can select from a long list of field limits to search through the databases.
19 Entrez Nucleotides: Limits glucose 6 phosphate dehydrogenaseTitle == DefinitionExclude Bulk SequencesmRNA molecule typeNuclear geneYou can limit the search by making some selections from among the choices in the flat fields of the files.
20 Document Summaries: Limits Now you have many less files than you started out with, because of the limits choices. However, you still have a lot of choices.
21 Adding Terms: Preview/Index AccessionAll FieldsAuthor NameEC/RN NumberFeature keyFilterGene NameIssueJournal NameKeywordModification DateOrganismPage NumberPrimary AccessionPropertiesProtein NamePublication DateSeqID StringSequence LengthSubstance NameText WordTitle WordUidVolumegreen plantsgreen plantsYou may not know all the terms that are in the database to select from. You can use the Preview/index to give you some lists of searchable keywords. This could help you set more meaningful limits.
22 Plant cytosolic g6pdh mRNAs Using the green plants limits, you are now down to only one page of G6PD genes and they are all from green plants.
23 Database Neighbors and Interlinking What makes Entrez more powerful than many services is that most of its records are linked to other records, both within a given database (such as Nucleotide) and between databases.Links within a database are called “neighbors” (e.g., Nucleotide neighbors).
24 Links Between Databases Protein and Nucleotide neighbors are determined by performing similarity searches using the BLAST algorithm to compare the entry amino acid or DNA sequence to all other amino acid or DNA sequences in the database. We will discuss more about BLAST later.Nucleotide sequence records in the Nucleotide database are linked to the PubMed citation of the article in which the sequences were published.Protein sequence records are linked to the nucleotide sequence from which the protein was translated.
25 Plant cytosolic g6pdh mRNAs SummaryBriefGenBankASN.1FASTAGI listLinkOutPubMed LinksProtein LinksNucleotide NeighborsPopSet LinksStructure LinksGenome LinksTaxonomy LinksOMIM LinksFormatsLinks and neighbors(related records)Now you can look for other information that might be close to what you are studying using links and neighbors, which are related recods. This helps you broaden your search a little, in case the files didn’t add your search terms as keywords.
26 LinkOutLinkOut is a feature of Entrez that is designed to provide users with links from PubMed and other Entrez databases to a wide variety of relevant web-accessible online resources:Full-text publicationsOther biological databasesConsumer health informationResearch toolsThe goal is to facilitate access to relevant online resources beyond the Entrez system to extend, clarify, or supplement information found in the Entrez databases.
27 Protein DatabaseThe protein database includes proteins from translate regions of DNA in GenBank as well as sequence from PIRThe entry includes:The name of the proteinHow the protein sequence was derivedAn accession and a PID numberThe number of amino acids
28 Protein Entry The Entry also includes: Structural information for the protein (if known)Helices and -SheetsDomainsEtcThe sequence of amino acids comprising the protein
29 Setting Protein Database search limits Choose Protein from the drop-down menuCan do a Boolean searchOr can set LIMITSFields (eg Author, Journal, etc.)Gene Location (genomic, mitochondrial etc)Segmented SequenceOnly from (Database to check)Modification date
30 Linking Between Databases Sometimes you will pull up a record and you have no idea what organism the gene you are looking at is from.For Example, the following record- what is Medicago sativa ?
32 Taxonomy to the RescueEntrez lets you click a live link from the record and determine what organism Medicago sativa is.It is alfalfa.You can also tell what it is related to taxonomically, because sometimes the common name isn’t very useful either!
35 What is BLink BLink - BLAST Link Someone has done a BLAST search already, and you can just retrieve it!BLink displays the graphical output of pre-computed blastp results against the protein non-redundant (nr) database.
36 This graphical output includes: Alignment of up to 200 BLAST hits on the query sequenceBest Hits to each organismList of known protein domains in the query sequenceFilter hits by selecting the BLAST cutoff scoreDistribution of hits by taxonomic groupingDisplay of similar sequences with known 3D structureFilter hits by database and/or by taxonomic groupingDisplay a taxonomic tree of all organisms with similar sequences
37 PopSet LinksThe PopSet database contains aligned sequences submitted as a set resulting from a population, phylogenetic, or mutation study.These alignments describe such events as evolution and population variation.The PopSet database contains both nucleotide and protein sequence data.
44 Protein Structures can also be in databases This is an excellent review of this topicis a useful reviewTutorial.
45 Entrez links to structure databases The Structure database or Molecular Modeling Database (MMDB) contains experimental data from crystallographic and NMR structure determinations.The data for MMDB are obtained from the Protein Data Bank (PDB).The NCBI has cross-linked structural data to bibliographic information, to the sequence databases, and to the NCBI taxonomy.Use Cn3D, the NCBI 3D structure viewer, for easy interactive visualization of molecular structures from Entrez.
46 Structure Search results The structure of proteins are also in a databaseSearch as beforeYour search results are similar
47 Structure Entry The structure Entry has links to the other databases And it will allow you download a file to open with a structure viewer program
48 Proteins with similar structures and functions have been identified in the databases
55 MMDB: Molecular Modeling Data Base Derived from experimentally determined PDB recordsValue added to PDB records including:Addition of explicit chemical graph informationValidationInclusion of Taxonomy, Citation,and other informationConversion to ASN.1 data description languageStructure neighbors determined byVector Alignment Search Tool (VAST)
57 Cn3D 4.1You can view the structure in the database in several different formats.
58 Cn3D 4.1: Structural Alignment Conserved ATP binding siteTwo genes can be compared with both their sequences and their structures to see if they aligh into the same shape of molecule. Often the structure and function of a gene is more conserved that its exact primary amino acid sequence.Src Kinase H. sapiensCasein kinase S. pombe
59 Cn3D: Simple Homology Modeling Here you can see the 3-D structure of two related genes- one from human and one from swordtail, a type of fish. These genes are red where the sequence matches and blue where it doesn’t. The 3-D sequence can be highlighted to show the regions where the sequence isn’t the same.humanswordtail
60 Using Cn3D to model domains Domains are functional parts of a protein. Here the same small functional part of a series of proteins has been lined up and the sequence is compared. Some regions match very closely and are in pink. Others are similar. The important serine is highlighted in yellow.
61 Other services and databases from the NCBI LocusLink to all possible information from NCBI and beyond for a few well characterized model organisms.LocusLink is a great starting point: it collects key information on each gene/protein from major databases. It now covers 8 organisms.RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635)
62 Locus Links Results of a Locus links search, includes: Locus ID SpeciesLocus symbolLocus nameLocus locationLinksProtein DatabaseOMIMReference SequenceRelated GenBank SequencesHomologene DataUniGeneVariation DataWhen you find a gene is in locus link, you can easily compare it with other useful information. Not all genes have the same amount of information known.
64 Protein DatabaseThe Protein database contains sequence data from the translated coding regions from DNA sequences in GenBank, EMBL, and DDBJ as well as protein sequences submitted to:Protein Information Resource (PIR)SWISS-PROTProtein Research Foundation (PRF)Protein Data Bank (PDB) (sequences from solved structures)
65 NCBI Protein Databases GenPept GenBank, EMBL, DDBJ CDS translationsRefSeq mRNA based (NP_) and genome based (XP_)Swiss-Prot curated high quality protein reviewsPIR protein information resource Georgetown UniversityPRF protein resource foundationPDB Protein Databank sequences from structures
66 Entrez Protein GenPept (GB,EMBL, DDBJ) 3,442,298 RefSeq 856,191 Third Party Annotation ,834Swiss Prot ,508PIR ,821PRF ,079Total ,442,298BLAST nr ,642,191
68 Related Proteins: Redundancy Redundant SequencesSince the protein database pulls in sequences from all these other databases, some of the entries can be considered redundant or duplicates. You can do some searches with the “nonredundant” database. Sometimes, however, you might want more information that is found in some of the specialized databases. SwissProt is an example- lots more information has been added to these entries as they are more heavily curated.
69 Related Proteins: Links Sequence from MutL structure
85 But wait! There’s more!There is even more at NCBI that I have covered here.This site map is also a guide to NCBI resources. Each link leads to a brief description of the resource on this page, then to the resource itself.
86 There are many bioinformatics servers outside NCBI. Try ExPASy’s sequence retrieval system at(ExPASy = Expert Protein Analysis System)Or try ENSEMBL at for a premier human genome web browser.