Presentation is loading. Please wait.

Presentation is loading. Please wait.

Biological databases.

Similar presentations

Presentation on theme: "Biological databases."— Presentation transcript:

1 Biological databases

2 The Progress # of dna base pairs (billions) in GenBank
First 2 bacterial genomes complete 122+ bacterial genomes Data from NCBI and TIGR ( and ) first eukaryote complete (yeast) first metazoan complete (flatworm) 17 eukaryotic genomes complete or near completion including Homo sapiens, mouse and fruitfly Official “15 year” Human Genome Project: # of dna base pairs (billions) in GenBank In the last 5 years or so the amount of data has grown exponentially, including the growth of online databases and resources. In this slide we have the growth of the total number of base pairs and the total number of genomes completed since the beginning of the Human Genome Project. As you can see, the sheer number and growth of this resource has been impressive—and daunting—in the last 5 years. Few scientists are aware of, or make full use of, all the open-source and public resources available to them through the internet. The Annual Nucleic Acids Research Database issue listing contained 548 databases this year!! And, as this quote mentions, only half of those who use the databases are familiar with their tools. This Wellcome Trust study also made it clear that many people become users of a database after being told about it by colleagues. “Despite the large amount of publicity surrounding the Human Genome Project, a recent survey conducted on behalf of the Wellcome Trust indicates that only half of biomedical researchers using genome databases are familiar with the tools that can be used to actually access the data. In “The Molecular Biology Database Collection: 2003 update” by Andreas D. Baxevanis in the Jan 1, 2003 NAR database issue.

3 International nucleotide sequence Database collaboration.
EMBL European Molecular Biology Laboratory DDBJ (Japan) PubMed, Nucleotides Proteins Genomes Taxonomy Structure Domains GenBank (NCBI)

4 NCBI - GenBank GenBank: All publicly available nucleotide and amino acid sequences. Data Source: Direct submission from scientists Literature. Genome Sequencing DNA database divisions (examples) Organism division (Human, Bacteria, etc). Molecule division (DNA, RNA, protein). Sequence division (Genome, ESTs STSs).

5 sequence databases An optimal database should be:
Comprehensive, well annotated, easily searched & easy data retrieval, provide cross-references The GenBank database: As of April 2004, there are over 8,989,342,565 bases in GenBank. Problems 1: huge databases  Redundancy and inadequate sequences. Problem 2: Submission by users  Redundancy, Only the submitter can change it, not always up to date, partial annotation. A database is a collection of data that is organized so that its content can easily be accessed, managed and updated

6 GenBank HELP!!! Instructions: Nucleotide database human[Organism] AND dUTPase[Title] without limits Add limits on ESTs. (EST: mRNA origin. STS: markers. TPA: third party, GSS: sequences are genomic in origin, unlike mRNA origin… explain all limits!) Show how to do it in preview/Index! Look at complete CDS, and then in exon 3 : redundency Analyze both complete cds and exon 3. Show fasta format!!! + send to file, to text, to clipboard Look at protein database!

7 Unique Identifiers at NCBI
accession numbers apply to a complete sequence record sequence identification numbers apply to the individual sequences within a record GI number assigned consecutively by NCBI to each sequence it processes Version number accession number followed by a dot and a version number. The format of accession numbers varies, depending upon the source database: GenBank/EMBL/DDBJ - One letter followed by five digits, e.g.: U12345 or two letters followed by six digits, e.g.:AY123456 Swiss-Prot - All are six characters: [O,P,Q][0-9][A-Z,0-9][A-Z,0-9][A-Z,0-9][0-9] e.g.:P12345 and Q9JJS7 RefSeq - Two letters, an underscore bar, and six digits, e.g.:NM_ (mRNA) NT_ (contig) NC (chromosome) NG (genomic region). If a sequence changes in any way, it receives a new GI number, and the version number is incremented by one.

8 Data Formats Many data formats used by sequences:
FASTA format, GenBank format, EMBL format…

9 GenBank format See

10 GenBank format

11 GenBank format

12 FASTA format Example: Easy to parse Least informative
>my_sequence_name BTYKLJGJFKHVHFMGHF KHGJFJFVKHGJHLNLNLJ KJGKGKGKHLJH Easy to parse Least informative Default input format for sequence analysis software (e.g., BLAST, CLASTALW).

13 TrEMBL Swiss-Prot (
Core data: sequence, taxonomy and bibliographic reference. Annotation data: function, domain structure, post-translational modifications, protein variants, etc. a curated protein sequence database provide a high level of annotation minimal level of redundancy high level of integration with other databases (cross references). TrEMBL a computer-annotated supplement of Swiss-Prot that contains all the translations of EMBL nucleotide sequence entries not yet integrated in Swiss-Prot.

14 ExPASy Proteomics Server

15 Swiss-Prot file format

16 Flat-file original Swiss-Prot format

17 More resources Some good places for refreshing your biochemistry
Address Description The glycan structure database The ultimate lipid database ChemIDplus: Identifying molecules by drawing them up! The main resources for biochemical pathways and enzymes Address Description Find which metabolic pathway a molecule belongs to. The famous Kyoto Encyclopedia of Genes and Genomes (KEGG). E.C. (Enzyme Codes) numbers or gene names are the best starting points for this resource. The comprehensive enzyme information system BRENDA. The official site for enzyme nomenclature of the International Union of Biochemistry and Molecular Biology (IUBMB). The Encyclopedia of E. coli Genes and Metabolism. It is progressively extending to other bacteria.

18 Search sequence databases
Two search methods Text based searching– searches textual information contained in header sections of database entries Sequence search– searches sequence information with sequence queries – next week!

19 Text based searching - Search for query words in specific fields.
Choose your database and add limits. Examples: Entrez, SRS.

20 NCBI – Entrez (
Entrez is the search tool for NCBI databases. The search starts by choosing the relevant group of databases (Nucleotide, Protein, etc). Use field qualifiers, logical operators, and a “limits” form. Boolean operator, AND, OR, NOT Group together by using () Example: cytochrome AND human cytochrome AND (human OR mouse) Always use upper case for operators. If you don’t use any operator the query words are looked together! Field qualifiers: Search in the specific field: Author, organism, journal … homo sapiens [organism] AND kinase AND nature [journal] Cytochrome b Cytochrome b AND human Cytochrome b AND human[organism] Cytochrome b AND human[organism] and limits.

21 Entrez Protein Database http://www. ncbi. nlm. nih. gov/entrez/query
Includes SwissProt, PIR, PRF, PDB, and translations from annotated coding regions in GenBank and RefSeq.

22 Entrez Nucleotides database http://www. ncbi. nlm. nih
Includes GenBank, RefSeq, and PDB. As of April 2004, there are over 38,989,342,565 bases.

23 SRS
Choose Library Fill Query form Get Results

24 Gene-centric Databases
Repository-type database: - Many pieces of sequences related to a sequence - Examples: GenBank/SwissProt Gene-centric database: All the sequence information relevant to a given gene is made accessible at once: Get the whole story at once! Provide easy access when the query is related to a gene or function. Examples: Gene, UniGene, RefSeq.

25 Gene
Gene provides a unified query environment for genes Query on names, symbols, accessions, publications, GO terms, chromosome numbers, E.C. numbers, and many other attributes associated with genes and the products they encode. Unique identifiers assigned to genes with known map positions. Supply key connections of map, sequence, expression, structure, function, citation, and homology data. Provide identifiers to UniGene, RefSeq, relevant GenBank entries, OMIM and SNPs. Can be considered as the successor to LocusLink

26 Refseq
non-redundancy   distinct accession series updates to reflect current knowledge of sequence data and biology ongoing curation by NCBI staff and collaborators, with reviewed records indicated. data validation and format consistency

27 ESTs division Uses: Problems: Gene predication.
Expression level (only clues). Alternative splicing. Problems: Redundant database. mistakes (single read-through). Incomplete coverage of genes: Only for Model eukaryotic organisms Rare tissues Low copy number of genes

28 UniGene
An automatically partitioning of GenBank sequences into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location. Focus on mRNA and EST information

29 Organism specific databases

30 Wouldn’t it be great if…
sequence Genome backbone: base position number chromosome band known genes predicted genes evolutionary conservation SNPs sts sites gap locations repeated regions microarray/expression data more… Annotation Tracks Links out to more data A great deal of information has come to us from the formal, organized Human Genome Project. But other data has come from individual laboratories doing traditional benchwork; some has come from the literature; and some of the data has come from new large-scale technologies that have arisen in the last few years, such as microarray data for gene expression detection. So—there is tremendous amounts of data around; and lots of places to try to find it. But—the UCSC Genome browser is great because it organizes a lot of this material in one place. It uses the backbone of the genome—the official backbone sequence of the Human Genome Project is nicknamed the Golden Path—and combines this golden path information with all kinds of other useful and important biological information, such as chromosome banding patterns, known, genes, predictions, expression data, , comparative genomics, SNPs, and so on…. All of this data is lined up in one place so you can quickly find new information about your regions of interest. And better still, all the data links out to other databases and web sites and literature so you can go as deep as you want into any topic that you care about…. As I show here in this diagram, the data is organized along the genomic sequence backbone. All of that other information that is available is referred to as “Annotation Tracks”. Later we’ll see that you can get to these regions of interest, and then link out to other great collections of data as well.

31 Solution: Genome Browsers, Or “map Viewers”
Introduce self. introduce the section. Our goal here is to cover the basics of getting the genome browser software to work for you; we want to introduce you to searching the USCS Genome Browser via text or sequences to get the information that you want. The materials used in this slide presentation were developed by Warren Lathe and Mary Mangan, from OpenHelix, LLC, under contract from the UCSC Genome Bioinformatics group.

32 NCBI Map Viewer


34 Ensemble (
Ensemble example:


36 UCSC Home page ( )
navigate General information Okay, so lets move on to what the web site actually looks like, and how to get your searches accomplished! When you first arrive at the UCSC genome bioinformatics site, this is what you will see. First, there is a section that contains general information about the site. Second, there is specific information about NEWS--new features, changes, the current state of the data that is available. This information is worth a quick check when you visit the site, in case there have been changes to the data since the last time you visited. But the real substance of the site—the tools—are accessible in a couple of ways from this page. There are navigation bars at the top and the sides which will permit you to access all of the really cool stuff that is going on here: This page will provide you access to the several types of tools that are available from this site: Tools = Browser, Blat, Tables, Downloads, FAQ, User Guide. Access from either navigation option—top or side. Mirrors = other locations, just in case one isn’t available to you—or one might have faster access from your location Archives = older data, sometimes you might need to troll through older data to re-examine something you found before Credits = these are the people who bring you this browser, very important. Cite Us = please cite the resource properly in your talks and papers; this helps them get grant $$!! Jobs = anyone looking? Links = a great collection of links to other tools that might be of use/interest to you Contact us = mailing list, error reports, mirroring information To actually get in and start searching the database, there are several options—you can search by text—gene name, gene symbol, keywords, ID, etc. To do this we will use the Genome Browser link. You can also search by sequences if you have a specific sequence of interest using the BLAT search tool—but we will start with text searching from the Genome Browser gateway—the link at the top for Genome Browser. However, the link that says Genomes will get you to the same location. For our purposes, we will click the link that says Genome Browser. Specific information— new features, current status, etc. UCSC Material developed by W.C. Lathe and M. Mangan,

Download ppt "Biological databases."

Similar presentations

Ads by Google