2 The Progress # of dna base pairs (billions) in GenBank First 2 bacterial genomes complete122+ bacterial genomesData from NCBI and TIGR( and )first eukaryote complete (yeast)first metazoan complete (flatworm)17 eukaryotic genomes complete or near completion including Homo sapiens, mouse and fruitflyOfficial “15 year”Human Genome Project:# of dna base pairs(billions)in GenBankIn the last 5 years or so the amount of data has grown exponentially, including the growth of online databases and resources. In this slide we have the growth of the total number of base pairs and the total number of genomes completed since the beginning of the Human Genome Project. As you can see, the sheer number and growth of this resource has been impressive—and daunting—in the last 5 years.Few scientists are aware of, or make full use of, all the open-source and public resources available to them through the internet. The Annual Nucleic Acids Research Database issue listing contained 548 databases this year!!And, as this quote mentions, only half of those who use the databases are familiar with their tools. This Wellcome Trust study also made it clear that many people become users of a database after being told about it by colleagues.“Despite the large amount of publicity surrounding theHuman Genome Project, a recent survey conducted on behalfof the Wellcome Trust indicates that only half of biomedicalresearchers using genome databases are familiar with thetools that can be used to actually access the data.In “The Molecular Biology Database Collection: 2003 update”by Andreas D. Baxevanis in the Jan 1, 2003 NAR database issue.
4 NCBI - GenBankGenBank: All publicly available nucleotide and amino acid sequences.Data Source:Direct submission from scientistsLiterature.Genome SequencingDNA database divisions (examples)Organism division (Human, Bacteria, etc).Molecule division (DNA, RNA, protein).Sequence division (Genome, ESTs STSs).
5 sequence databases An optimal database should be: Comprehensive, well annotated, easily searched & easy data retrieval, provide cross-referencesThe GenBank database:As of April 2004, there are over 8,989,342,565 bases in GenBank.Problems 1: huge databases Redundancy and inadequate sequences.Problem 2: Submission by users Redundancy, Only the submitter can change it, not always up to date, partial annotation.A database is a collection of data that is organized so that its content caneasily be accessed, managed and updated
6 GenBankHELP!!!Instructions:Nucleotide databasehuman[Organism] AND dUTPase[Title] without limitsAdd limits on ESTs. (EST: mRNA origin. STS: markers. TPA: third party, GSS: sequences are genomic in origin, unlike mRNA origin… explain all limits!)Show how to do it in preview/Index!Look at complete CDS, and then in exon 3 : redundencyAnalyze both complete cds and exon 3.Show fasta format!!! + send to file, to text, to clipboardLook at protein database!
7 Unique Identifiers at NCBI accession numbersapply to a completesequence recordsequence identification numbersapply to the individual sequenceswithin a recordGI number assigned consecutivelyby NCBI to eachsequence it processesVersion number accession number followedby a dot and aversion number.The format of accession numbers varies, depending upon the source database:GenBank/EMBL/DDBJ - One letter followed by five digits, e.g.: U12345 or two letters followed by six digits, e.g.:AY123456Swiss-Prot - All are six characters: [O,P,Q][0-9][A-Z,0-9][A-Z,0-9][A-Z,0-9][0-9] e.g.:P12345 and Q9JJS7RefSeq - Two letters, an underscore bar, and six digits, e.g.:NM_ (mRNA) NT_ (contig) NC (chromosome) NG (genomic region).If a sequence changes in any way, it receives a new GI number, and the version number is incremented by one.
8 Data Formats Many data formats used by sequences: FASTA format, GenBank format, EMBL format…
12 FASTA format Example: Easy to parse Least informative >my_sequence_nameBTYKLJGJFKHVHFMGHFKHGJFJFVKHGJHLNLNLJKJGKGKGKHLJHEasy to parseLeast informativeDefault input format for sequence analysis software (e.g., BLAST, CLASTALW).
13 TrEMBL Swiss-Prot (http://www.ebi.ac.uk/swissprot/) Core data: sequence, taxonomy and bibliographic reference.Annotation data: function, domain structure, post-translational modifications, protein variants, etc.a curated protein sequence databaseprovide a high level of annotationminimal level of redundancyhigh level of integration with other databases (cross references).TrEMBLa computer-annotated supplement of Swiss-Prot that contains all the translations of EMBL nucleotide sequence entries not yet integrated in Swiss-Prot.
14 ExPASy Proteomics Server http://www.expasy.org/
17 More resources Some good places for refreshing your biochemistry AddressDescriptionThe glycan structure databaselipid.bio.m.u-tokyo.ac.jpThe ultimate lipid databasechem.sis.nlm.nih.gov/chemidplus/ChemIDplus: Identifying molecules by drawing them up!The main resources for biochemical pathways and enzymesAddressDescriptionFind which metabolic pathway a molecule belongs to.The famous Kyoto Encyclopedia of Genes and Genomes (KEGG). E.C. (Enzyme Codes) numbers or gene names are the best starting points for this resource.brenda.bc.uni-koeln.deThe comprehensive enzyme information system BRENDA.The official site for enzyme nomenclature of the International Union of Biochemistry and Molecular Biology (IUBMB).The Encyclopedia of E. coli Genes and Metabolism. It is progressively extending to other bacteria.
18 Search sequence databases Two search methodsText based searching– searches textual information contained in header sections of database entriesSequence search– searches sequence information with sequence queries – next week!
19 Text based searching - Search for query words in specific fields. Choose your database and add limits.Examples: Entrez, SRS.
20 NCBI – Entrez (http://www.ncbi.nih.gov/Entrez/) Entrez is the search tool for NCBI databases.The search starts by choosing the relevant group of databases (Nucleotide, Protein, etc).Use field qualifiers, logical operators, and a “limits” form.Boolean operator, AND, OR, NOT Group together by using ()Example:cytochrome AND humancytochrome AND (human OR mouse)Always use upper case for operators.If you don’t use any operator the query words are looked together!Field qualifiers: Search in the specific field: Author, organism, journal …homo sapiens [organism] AND kinase AND nature [journal]Cytochrome bCytochrome b AND humanCytochrome b AND human[organism]Cytochrome b AND human[organism] and limits.
21 Entrez Protein Database http://www. ncbi. nlm. nih. gov/entrez/query Includes SwissProt, PIR, PRF, PDB, and translations from annotated coding regions in GenBank and RefSeq.
22 Entrez Nucleotides database http://www. ncbi. nlm. nih Includes GenBank, RefSeq, and PDB.As of April 2004, there are over 38,989,342,565 bases.
24 Gene-centric Databases Repository-type database:- Many pieces of sequences related to a sequence- Examples: GenBank/SwissProtGene-centric database:All the sequence information relevant to a given gene is made accessible at once: Get the whole story at once!Provide easy access when the query is related to a gene or function.Examples: Gene, UniGene, RefSeq.
25 Gene http://www.ncbi.nih.gov/entrez/query.fcgi?db=gene Gene provides a unified query environment for genesQuery on names, symbols, accessions, publications, GO terms, chromosome numbers, E.C. numbers, and many other attributes associated with genes and the products they encode.Unique identifiers assigned to genes with known map positions.Supply key connections of map, sequence, expression, structure, function, citation, and homology data.Provide identifiers to UniGene, RefSeq, relevant GenBank entries,OMIM and SNPs.Can be considered as the successor to LocusLink
26 Refseq http://www.ncbi.nlm.nih.gov/projects/RefSeq/ non-redundancy distinct accession seriesupdates to reflect current knowledge of sequence data and biologyongoing curation by NCBI staff and collaborators, with reviewed records indicated.data validation and format consistency
27 ESTs division Uses: Problems: Gene predication. Expression level (only clues).Alternative splicing.Problems:Redundant database.mistakes (single read-through).Incomplete coverage of genes:Only for Model eukaryotic organismsRare tissuesLow copy number of genes
28 UniGene http://www.ncbi.nlm.nih.gov/UniGene An automatically partitioning of GenBank sequences into a non-redundant set of gene-oriented clusters.Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location.Focus on mRNA and EST information
30 Wouldn’t it be great if… sequenceGenome backbone: base position numberchromosome bandknown genespredicted genesevolutionary conservationSNPssts sitesgap locationsrepeated regionsmicroarray/expression datamore…Annotation TracksLinks out tomore dataA great deal of information has come to us from the formal, organized Human Genome Project. But other data has come from individual laboratories doing traditional benchwork; some has come from the literature; and some of the data has come from new large-scale technologies that have arisen in the last few years, such as microarray data for gene expression detection.So—there is tremendous amounts of data around; and lots of places to try to find it. But—the UCSC Genome browser is great because it organizes a lot of this material in one place. It uses the backbone of the genome—the official backbone sequence of the Human Genome Project is nicknamed the Golden Path—and combines this golden path information with all kinds of other useful and important biological information, such as chromosome banding patterns, known, genes, predictions, expression data, , comparative genomics, SNPs, and so on….All of this data is lined up in one place so you can quickly find new information about your regions of interest. And better still, all the data links out to other databases and web sites and literature so you can go as deep as you want into any topic that you care about….As I show here in this diagram, the data is organized along the genomic sequence backbone. All of that other information that is available is referred to as “Annotation Tracks”. Later we’ll see that you can get to these regions of interest, and then link out to other great collections of data as well.
31 Solution: Genome Browsers, Or “map Viewers” Introduce self.introduce the section.Our goal here is to cover the basics of getting the genome browser software to work for you; we want to introduce you to searching the USCS Genome Browser via text or sequences to get the information that you want.The materials used in this slide presentation were developed by Warren Lathe and Mary Mangan, from OpenHelix, LLC, under contract from the UCSC Genome Bioinformatics group.
36 UCSC Home page ( genome.ucsc.edu ) navigateGeneral informationOkay, so lets move on to what the web site actually looks like, and how to get your searches accomplished!When you first arrive at the UCSC genome bioinformatics site, this is what you will see. First, there is a section that contains general information about the site. Second, there is specific information about NEWS--new features, changes, the current state of the data that is available. This information is worth a quick check when you visit the site, in case there have been changes to the data since the last time you visited.But the real substance of the site—the tools—are accessible in a couple of ways from this page. There are navigation bars at the top and the sides which will permit you to access all of the really cool stuff that is going on here:This page will provide you access to the several types of tools that are available from this site:Tools = Browser, Blat, Tables, Downloads, FAQ, User Guide. Access from either navigation option—top or side.Mirrors = other locations, just in case one isn’t available to you—or one might have faster access from your locationArchives = older data, sometimes you might need to troll through older data to re-examine something you found beforeCredits = these are the people who bring you this browser, very important.Cite Us = please cite the resource properly in your talks and papers; this helps them get grant $$!!Jobs = anyone looking?Links = a great collection of links to other tools that might be of use/interest to youContact us = mailing list, error reports, mirroring informationTo actually get in and start searching the database, there are several options—you can search by text—gene name, gene symbol, keywords, ID, etc. To do this we will use the Genome Browser link. You can also search by sequences if you have a specific sequence of interest using the BLAT search tool—but we will start with text searching from the Genome Browser gateway—the link at the top for Genome Browser. However, the link that says Genomes will get you to the same location. For our purposes, we will click the link that says Genome Browser.Specific information—new features, current status, etc.UCSC Material developed byW.C. Lathe and M. Mangan,