The Life Sciences Search Engine

The Life Sciences Search Engine
Using Entrez The Life Sciences Search Engine

Searching NCBI Databases Efficiently
Knowing how to retrieve the exact information you need in an efficient way is the fundamental and most important skill in Bioinformatics. Every NCBI database is designed and created for some specific purposes. A common mistake Bioinformatics novices make is searching for information in an inappropriate database. Entrez links among and within databases, making it easier to search for information.

What is Entrez? Entrez is an NCBI retrieval system designed for searching several linked databases. Entrez is a search tool for integrated access to the biological literature and sequence data. Entrez is extremely powerful, enabling the user to quickly move between the different specialized databases.

Entrez Entrez is divided into sites for nucleotide, protein, structure, genomes, OMIM, and more. You can use limits (such as RefSeq) to focus your Entrez search. When you conduct a search via Entrez, your query generates this screen, telling you the number of hits to your query.

The Entrez System

The Big Picture Books UCSC PopSet PubMed e! GDB ProbeSet Nucleotide
MGC Genome Protein Entrez LocusLink HGMD Taxonomy OMIM Structure Homologene SNP CDD 3D Domains UniSTS MapViewer UniGene

Entrez and LocusLink Entrez doesn’t link to all the databases that contain sequences, however! LocusLink has its own groups of links to specialty databases, since it doesn’t cover all the genomes yet.

Entrez: Database Integration
Word weight PubMed abstracts Phylogeny Taxonomy 3 -D Structure 3-D Structure VAST Genomes Nucleotide sequences Protein sequences BLAST BLAST

The (ever) Expanding Entrez System
PubMed Nucleotide Entrez UniGene Protein Journals Structure CDD Genome PopSet SNP OMIM 3D Domains Taxonomy UniSTS ProbeSet Books

Entrez Databases PubMed Biomedical literature Books Online textbooks
Nucleotide GenBank, EMBL, DDBJ, RefSeq, PDB Protein [GenBank, EMBL, DDBJ], RefSeq, SWISS-PROT, PIR, PRF, PDB Genome Complete genomes Taxonomy Organisms in NCBI sequence databases Structure MMDB: experimental 3D structures Domains CDD: conserved protein domains 3D Domains Compact 3D protein domains in MMDB OMIM Online Mendelian Inheritance in Man SNP Single nucleotide polymorphisms UniSTS Sequence Tagged Site markers ProbeSet Gene expression and microarray datasets PopSet Population study datasets UniGene Gene-based expressed sequence clusters

Nucleotide Database The Nucleotide database contains sequence data from GenBank, EMBL, and DDBJ, the members of the tripartite, international collaboration of sequence databases. EMBL is the European Molecular Biology Laboratory at Hinxton Hall, UK; DDBJ is the DNA Database of Japan in Mishima, Japan. Sequence data are also incorporated from the Genome Sequence Data Base (GSDB), Santa Fe, NM. Patent sequences are incorporated through arrangements with the U.S. Patent and Trademark Office (USPTO) and via the collaborating international databases from other international patent offices.

Entrez Nucleotides Primary GenBank / EMBL / DDBJ 35,116,960 Derivative
RefSeq ,219 Third Party Annotation ,182 PDB ,703 Total ,384,248 These slides are obtained and/or modified from the ncbi website “field guide to ncbi” clides and from

Database Searching with Entrez
Using limits and field restriction to find plant g6pdh Linking and neighboring with g6pdh

glucose 6 phosphate dehydrogenase
Entrez Nucleotides glucose 6 phosphate dehydrogenase The G6PD enzyme catalyzes the oxidation of glucose-6-phosphate to 6-phosphogluconate, while reducing nicotinamide adenine dinucleotide phosphate (NADP+ to NADPH). In terms of electron transfer, glucose-6-phosphate loses two electrons to become 6-phosphogluconate and NADP+ gains two electrons to become NADPH. This is the first step in the pentose phosphate pathway. This pathway, or shunt, as it is sometimes called, produces the 5- carbon sugar, ribose, which is an essential component of both DNA and RNA.

You will find MANY hits for this and can’t easily tell which ones are plants.

Limits Are Helpful Limits allow restriction of a search to a defined subset of the database. Limits can be set to restrict a search to a particular database field (e.g., the Author field). Limits can be set to search everything but a particular type of data (e.g., “exclude patent records”). Alternatively, limits can be set to search only a particular type of data (e.g., Genomic RNA/DNA) or to search only data from a particular source database (e.g., EMBL). Date limits and sequence length limits are also possible. The contents of each Entrez database differ, and therefore the Limits available for each database differ.

Entrez Nucleotides: Limits & Preview/Index
glucose 6 phosphate dehydrogenase Try using the Limits and Preview function to hone your search To find the Plant G6PD genes.

Entrez Nucleotides: Limits
Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Word Uid Volume Field Restriction glucose 6 phosphate dehydrogenase Exclude bulk sequences You can select from a long list of field limits to search through the databases.

Entrez Nucleotides: Limits
glucose 6 phosphate dehydrogenase Title == Definition Exclude Bulk Sequences mRNA molecule type Nuclear gene You can limit the search by making some selections from among the choices in the flat fields of the files.

Document Summaries: Limits
Now you have many less files than you started out with, because of the limits choices. However, you still have a lot of choices.

Adding Terms: Preview/Index
Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Word Uid Volume green plants green plants You may not know all the terms that are in the database to select from. You can use the Preview/index to give you some lists of searchable keywords. This could help you set more meaningful limits.

Plant cytosolic g6pdh mRNAs
Using the green plants limits, you are now down to only one page of G6PD genes and they are all from green plants.

Database Neighbors and Interlinking
What makes Entrez more powerful than many services is that most of its records are linked to other records, both within a given database (such as Nucleotide) and between databases. Links within a database are called “neighbors” (e.g., Nucleotide neighbors).

Links Between Databases
Protein and Nucleotide neighbors are determined by performing similarity searches using the BLAST algorithm to compare the entry amino acid or DNA sequence to all other amino acid or DNA sequences in the database. We will discuss more about BLAST later. Nucleotide sequence records in the Nucleotide database are linked to the PubMed citation of the article in which the sequences were published. Protein sequence records are linked to the nucleotide sequence from which the protein was translated.

Plant cytosolic g6pdh mRNAs
Summary Brief GenBank ASN.1 FASTA GI list LinkOut PubMed Links Protein Links Nucleotide Neighbors PopSet Links Structure Links Genome Links Taxonomy Links OMIM Links Formats Links and neighbors (related records) Now you can look for other information that might be close to what you are studying using links and neighbors, which are related recods. This helps you broaden your search a little, in case the files didn’t add your search terms as keywords.

LinkOut LinkOut is a feature of Entrez that is designed to provide users with links from PubMed and other Entrez databases to a wide variety of relevant web-accessible online resources: Full-text publications Other biological databases Consumer health information Research tools The goal is to facilitate access to relevant online resources beyond the Entrez system to extend, clarify, or supplement information found in the Entrez databases.

Protein Database The protein database includes proteins from translate regions of DNA in GenBank as well as sequence from PIR The entry includes: The name of the protein How the protein sequence was derived An accession and a PID number The number of amino acids

Protein Entry The Entry also includes:
Structural information for the protein (if known) Helices and -Sheets Domains Etc The sequence of amino acids comprising the protein

Setting Protein Database search limits
Choose Protein from the drop-down menu Can do a Boolean search Or can set LIMITS Fields (eg Author, Journal, etc.) Gene Location (genomic, mitochondrial etc) Segmented Sequence Only from (Database to check) Modification date

Linking Between Databases
Sometimes you will pull up a record and you have no idea what organism the gene you are looking at is from. For Example, the following record- what is Medicago sativa ?

Entrez GenBank / GenPept

Taxonomy to the Rescue Entrez lets you click a live link from the record and determine what organism Medicago sativa is. It is alfalfa. You can also tell what it is related to taxonomically, because sometimes the common name isn’t very useful either!

Taxonomy Link

Advanced Neighbors: BLink

What is BLink BLink - BLAST Link
Someone has done a BLAST search already, and you can just retrieve it! BLink displays the graphical output of pre-computed blastp results against the protein non-redundant (nr) database.

This graphical output includes:
Alignment of up to 200 BLAST hits on the query sequence Best Hits to each organism List of known protein domains in the query sequence Filter hits by selecting the BLAST cutoff score Distribution of hits by taxonomic grouping Display of similar sequences with known 3D structure Filter hits by database and/or by taxonomic grouping Display a taxonomic tree of all organisms with similar sequences

PopSet Links The PopSet database contains aligned sequences submitted as a set resulting from a population, phylogenetic, or mutation study. These alignments describe such events as evolution and population variation. The PopSet database contains both nucleotide and protein sequence data.

Protein Neighbors->PopSet Links

Protein Neighbors->Genome Links

PopSet search results The results or a PopSet search
The PopSet database includes alignments of genes from multiple organisms OR different gene families OR mutational analyses

PopSet Entry The PopSet entry includes: The title of the paper/study
The length of the sequence(s) aligned The number of aligned sequences

PopSet Entry without alignment
The PopSet Entry without an alignment Title of the study The number of sequences included Links to the sequences

Entrez Structures

Protein Structures can also be in databases
This is an excellent review of this topic is a useful review Tutorial.

Entrez links to structure databases
The Structure database or Molecular Modeling Database (MMDB) contains experimental data from crystallographic and NMR structure determinations. The data for MMDB are obtained from the Protein Data Bank (PDB). The NCBI has cross-linked structural data to bibliographic information, to the sequence databases, and to the NCBI taxonomy. Use Cn3D, the NCBI 3D structure viewer, for easy interactive visualization of molecular structures from Entrez.

Structure Search results
The structure of proteins are also in a database Search as before Your search results are similar

Structure Entry The structure Entry has links to the other databases
And it will allow you download a file to open with a structure viewer program

Proteins with similar structures and functions have been identified in the databases

BLink: Advanced Protein Neighbors

BLink: Related Structures
Related biological functions usually have related structures.

Viewing Structure in Cn3D
You can download Cn3D (a structural viewer program) from NCBI This will allow you to view the structures from the structure database

Cn3D Text Window The Text window of Cn3D will align two or more proteins so you can compare the structure of multiple proteins

BLink: Human Homologue
Often in biomedical sciences, we want to find the human gene rather than just the mouse, dog, honeybee or rabbit gene. Selecting the human homologue can do this.

Human RefSeqs: Genome Reagents

MMDB: Molecular Modeling Data Base
Derived from experimentally determined PDB records Value added to PDB records including: Addition of explicit chemical graph information Validation Inclusion of Taxonomy, Citation, and other information Conversion to ASN.1 data description language Structure neighbors determined by Vector Alignment Search Tool (VAST)

Structure Summary Cn3D viewer Structure Neighbors 3D Domain Neighbors
Conserved Domains

Cn3D 4.1 You can view the structure in the database in several different formats.

Cn3D 4.1: Structural Alignment
Conserved ATP binding site Two genes can be compared with both their sequences and their structures to see if they aligh into the same shape of molecule. Often the structure and function of a gene is more conserved that its exact primary amino acid sequence. Src Kinase H. sapiens Casein kinase S. pombe

Cn3D: Simple Homology Modeling
Here you can see the 3-D structure of two related genes- one from human and one from swordtail, a type of fish. These genes are red where the sequence matches and blue where it doesn’t. The 3-D sequence can be highlighted to show the regions where the sequence isn’t the same. human swordtail

Using Cn3D to model domains
Domains are functional parts of a protein. Here the same small functional part of a series of proteins has been lined up and the sequence is compared. Some regions match very closely and are in pink. Others are similar. The important serine is highlighted in yellow.

Other services and databases from the NCBI
LocusLink to all possible information from NCBI and beyond for a few well characterized model organisms. LocusLink is a great starting point: it collects key information on each gene/protein from major databases. It now covers 8 organisms. RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635)

Locus Links Results of a Locus links search, includes: Locus ID
Species Locus symbol Locus name Locus location Links Protein Database OMIM Reference Sequence Related GenBank Sequences Homologene Data UniGene Variation Data When you find a gene is in locus link, you can easily compare it with other useful information. Not all genes have the same amount of information known.

LocusLink: Selected Higher Genomes
OMIM RefSeq GenBank dbSNP UniGene Full report PubMed HomoloGene Map Viewer Protein

Protein Database The Protein database contains sequence data from the translated coding regions from DNA sequences in GenBank, EMBL, and DDBJ as well as protein sequences submitted to: Protein Information Resource (PIR) SWISS-PROT Protein Research Foundation (PRF) Protein Data Bank (PDB) (sequences from solved structures)

NCBI Protein Databases
GenPept GenBank, EMBL, DDBJ CDS translations RefSeq mRNA based (NP_) and genome based (XP_) Swiss-Prot curated high quality protein reviews PIR protein information resource Georgetown University PRF protein resource foundation PDB Protein Databank sequences from structures

Entrez Protein GenPept (GB,EMBL, DDBJ) 3,442,298 RefSeq 856,191
Third Party Annotation ,834 Swiss Prot ,508 PIR ,821 PRF ,079 Total ,442,298 BLAST nr ,642,191

Protein Link BLAST Link Conserved Domains

Related Proteins: Redundancy
Redundant Sequences Since the protein database pulls in sequences from all these other databases, some of the entries can be considered redundant or duplicates. You can do some searches with the “nonredundant” database. Sometimes, however, you might want more information that is found in some of the specialized databases. SwissProt is an example- lots more information has been added to these entries as they are more heavily curated.

Related Proteins: Links
Sequence from MutL structure

BLink: non-redundant relatives
Arabidopsis homolog Conserved Domain

MLH1 Domain Structure: CDD
CDD is a database of protein domains. The source databases for CDD are Pfam, Smart, and COG. This link has more information on domains. ATPase Domain Mismatch Repair Domain

MLH1: ATPase Domain

1BGQ: ATPase Domain in Cn3D
Yeast HSP90 ATP Binding site helix You can highligh at region in the primary sequence and then identify it in the 3-D model. This is very handy!

Variations Human MLH1

Finding structural models
BLink Finding structural models

Mapping Variation Onto Structure
Loads sequence alignment and structure in Cn3D Bacterial DNA mismatch repair proteins

Mapping Variation Onto Structure
Asn Ile Ile – Val Conserved Asn

NCBI Genome Databases The Genome database provides views for a variety of genomes, complete chromosomes, sequence maps with contigs, and integrated genetic and physical maps.

Microbial Genomes ZWF

Genome search results Genome Search Results
The Genome database includes full (and some partial) genomes from viruses to complex organisms

Genome Entry Genome entries include Maps of the genome
Links to the sequence The organism for the genome

Genes Database: All Genomes
Coming soon!

Genes Database: All Genomes

But wait! There’s more! There is even more at NCBI that I have covered here. This site map is also a guide to NCBI resources. Each link leads to a brief description of the resource on this page, then to the resource itself.

There are many bioinformatics servers outside NCBI.
Try ExPASy’s sequence retrieval system at (ExPASy = Expert Protein Analysis System) Or try ENSEMBL at for a premier human genome web browser.

The Life Sciences Search Engine

Similar presentations

Presentation on theme: "The Life Sciences Search Engine"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Life Sciences Search Engine

Similar presentations

Presentation on theme: "The Life Sciences Search Engine"— Presentation transcript:

Similar presentations

About project

Feedback