NCBI FieldGuide NCBI Molecular Biology Resources July 8, 2004 University of São Paulo, Brazil “ Third Latin American Course on Bioinformatics for Tropical.

Slides:



Advertisements
Similar presentations
The Life Sciences Search Engine
Advertisements

Databases (“knowledge bases”) used in genome analysis
Beyond PubMed and BLAST: Exploring NCBI tools and databases Kate Bronstad David Flynn Alumni Medical Library.
Created as a part of NLM in 1988 Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
CS 177 Hands-on lab with databases Quiz #1 Summary: Nucleotide and protein databases Sequence formats Lab exercises Quiz #1 Summary: Nucleotide and protein.
COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases Xingquan Zhu Dept. of Computer Science and Engineering.
NCBI web resources I: databases and Entrez Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Archives and Information Retrieval
Biological databases.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Lecture 2.21 Retrieving Information: Using Entrez.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
An Introduction to Bioinformatics Molecular Biology Databases.
Introductory Overview
On line (DNA and amino acid) Sequence Information
Gene Expression Omnibus (GEO)
Genomics, Proteomics, and Bioinformatics Biology 224 Instructor: Tom Peavy January 29, 2008.
Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.
NCBI FieldGuide A Minimal Guide to NCBI Nucleotide Resources.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
NCBI NCBI Molecular Biology Resources A Field Guide Nov. 6, 2001.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
2 February, 2007 Life Science: Organisms. 2 February, 2007 Genomics “The genetic blueprints of all people generally have the same information, with approximately.
1 Database Resources of the National Center for Biotechnology Information Baharak Rastegari MEDG 505 presentation February 3, 2005 David.
Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management.
NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Using Entrez.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
NCBI FieldGuide NCBI Molecular Biology Resources Part 2 November 2008 Peter Cooper.
1 Review of Biological Database Utilization. 2 Biological Databases We will discuss: Usefulness to the bioinformaticist Database types Search methods.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
Part I: Identifying sequences with … Speaker : S. Gaj Date
جلسه اول بیو انفورماتیک گردآوری:مسعود رسول آبادی
Organizing information in the post-genomic era The rise of bioinformatics.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.
NCBI FieldGuide NCBI Molecular Biology Resources March 2007 Using Entrez.
NCBI Literature Databases: PubMed
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis
Computer Storage of Sequences
EBI is an Outstation of the European Molecular Biology Laboratory. EBI patent related services Jennifer McDowall Senior Scientist, EMBL-EBI 3 rd Annual.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
A Field Guide to GenBank and NCBI Molecular Biology Resources
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
NCBI FieldGuide September 29, 2004 ICGEB NCBI Molecular Biology Resources A Field Guide part 1.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
NCBI: something old, something new. What is NCBI? Create automated systems for knowledge about molecular biology, biochemistry, and genetics. Perform.
NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 (post intermission) September 30, 2004 ICGEB.
Database resources of the National Center for Biotechnology The National Center for Biotechnology Information (NCBI) at the National Institutes of Health.
NCBI PubMed NCBI Literature Databases: PubMed Session #1, April 28, 2005 Session #2, April 29, 2005 Ho Chi Minh City, VietNam.
Keeping Current: Genetics Resources. This workshop will provide an overview of NCBI resources for finding-- Background information & journal articles.
Introduction to Genes and Genomes with Ensembl
A Practical Guide to NCBI BLAST
Retrieving Information: Using Entrez
NCBI Molecular Biology Resources
Biological databases: Collection, storage and maintenance
Archives and Information Retrieval
BLAST.
Biological Databases.
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

NCBI FieldGuide NCBI Molecular Biology Resources July 8, 2004 University of São Paulo, Brazil “ Third Latin American Course on Bioinformatics for Tropical Disease Research " Chuong Huynh –

NCBI FieldGuide About NCBI and NCBI Databases Entrez Databases and Text Searching Genome Resources NCBI Resources

NCBI FieldGuide The National Center for Biotechnology Information Created in 1988 as a part of the National Library of Medicine at NIH –Establish public databases –Research in computational biology –Develop software tools for sequence analysis –Disseminate biomedical information Bethesda,MD

NCBI FieldGuide Types of Databases Primary Databases –Original submissions by experimentalists –Content controlled by the submitter Examples: GenBank, SNP, GEO Derivative Databases –Built from primary data –Content controlled by third party (NCBI) Examples: Refseq, TPA, RefSNP, UniGene, NCBI Protein, Structure, Conserved Domain

NCBI FieldGuide The Entrez System

NCBI FieldGuide Entrez Nucleotides Primary GenBank / EMBL / DDBJ 39,816,263 Derivative RefSeq 299,238 Third Party Annotation 4,265 PDB 5,062 Other Patent 2,152,071 Total 42,276,899 June 18, 2004

NCBI FieldGuide Entrez Protein GenPept (GB,EMBL, DDBJ) 3,338,062 RefSeq 1,008,645 Third Party Annotation 4,685 Swiss Prot 154,397 PIR 282,821 PRF 12,079 PDB53,360 Patents 308,551 Total 4,797,015 BLAST nr 1,857,896

NCBI FieldGuide What is GenBank? NCBI’s Primary Sequence Database Nucleotide only sequence database Archival in nature GenBank Data –Direct submissions (traditional records ) –Batch submissions (EST, GSS, STS) –ftp accounts (genome data) Three collaborating databases –GenBank –DNA Database of Japan (DDBJ) –European Molecular Biology Laboratory (EMBL) Database

NCBI FieldGuide EBI GenBank DDBJ EMBL EMBL Entrez SRS getentry NIG CIB NCBI NIH Submissions Updates Submissions Updates Submissions Updates International Sequence Database Collaboration

NCBI FieldGuide GenBank: NCBI’s Primary Sequence Database ftp://ftp.ncbi.nih.gov/genbank/ Release 141April ,676,218 Records 38,989,342,565 Nucleotides >140,000Species 146 Gigabytes 606 files full release every two months incremental and cumulative updates daily available only through internet

NCBI FieldGuide Sequence Records (millions) Total Base Pairs (billions) Sequence records Total base pairs ’83 ’84 ’85 ’86 ’87 ’88 ’89 ’90 ’91 ’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01 ’02 ’03 ’04 The Growth of GenBank

NCBI FieldGuide Organization of GenBank: GenBank Divisions Records are divided into 17 Divisions. 11 Traditional 6 Bulk Traditional Divisions: Direct Submissions (Sequin and BankIt) Accurate Well characterized PRI (27) Primate PLN (11) Plant and Fungal BCT (8) Bacterial and Archeal INV (6) Invertebrate ROD (12) Rodent VRL (4) Viral VRT (6) Other Vertebrate MAM (1) Mammalian (ex. ROD and PRI) PHG (1) Phage SYN (1) Synthetic (cloning vectors) UNA (1) Unannotated Entrez query: gbdiv_xxx[Properties]

NCBI FieldGuide Organization of GenBank: GenBank Divisions Records are divided into 17 Divisions. 11 Traditional 6 Bulk BULK Divisions: Batch Submission ( and FTP) Inaccurate Poorly characterized EST (305) Expressed Sequence Tag GSS (104) Genome Survey Sequence HTG (62) High Throughput Genomic STS (4) Sequence Tagged Site HTC (4) High Throughput cDNA PAT (15) Patent Entrez query: gbdiv_xxx[Properties]

NCBI FieldGuide LOCUS AF bp mRNA linear INV 23-OCT-2002 DEFINITION Limulus polyphemus myosin III mRNA, complete cds. ACCESSION AF VERSION AF GI: KEYWORDS. SOURCE Limulus polyphemus (Atlantic horseshoe crab) ORGANISM Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. REFERENCE 1 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein JOURNAL J. Neurosci. 18 (12), (1998) MEDLINE PUBMED REFERENCE 2 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REFERENCE 3 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REMARK Sequence update by submitter COMMENT On Mar 2, 2000 this sequence version replaced gi: A Traditional GenBank Record

LOCUS AF bp mRNA linear INV 23-OCT-2002 DEFINITION Limulus polyphemus myosin III mRNA, complete cds. ACCESSION AF VERSION AF GI: KEYWORDS. SOURCE Limulus polyphemus (Atlantic horseshoe crab) ORGANISM Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. REFERENCE 1 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein JOURNAL J. Neurosci. 18 (12), (1998) MEDLINE PUBMED REFERENCE 2 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REFERENCE 3 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REMARK Sequence update by submitter COMMENT On Mar 2, 2000 this sequence version replaced gi: GenBank Record: Locus LOCUS AF bp mRNA linear INV 23-OCT-2002 Molecule type Division Modification Date Locus name Length

LOCUS AF bp mRNA linear INV 23-OCT-2002 DEFINITION Limulus polyphemus myosin III mRNA, complete cds. ACCESSION AF VERSION AF GI: KEYWORDS. SOURCE Limulus polyphemus (Atlantic horseshoe crab) ORGANISM Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. REFERENCE 1 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein JOURNAL J. Neurosci. 18 (12), (1998) MEDLINE PUBMED REFERENCE 2 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REFERENCE 3 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REMARK Sequence update by submitter COMMENT On Mar 2, 2000 this sequence version replaced gi: GenBank Record: Identifiers ACCESSION AF VERSION AF GI:

LOCUS AF bp mRNA linear INV 23-OCT-2002 DEFINITION Limulus polyphemus myosin III mRNA, complete cds. ACCESSION AF VERSION AF GI: KEYWORDS. SOURCE Limulus polyphemus (Atlantic horseshoe crab) ORGANISM Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. REFERENCE 1 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein JOURNAL J. Neurosci. 18 (12), (1998) MEDLINE PUBMED REFERENCE 2 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REFERENCE 3 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REMARK Sequence update by submitter COMMENT On Mar 2, 2000 this sequence version replaced gi: GenBank Record: Organism SOURCE Limulus polyphemus (Atlantic horseshoe crab) ORGANISM Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. NCBI’s Taxonomy

FEATURES Location/Qualifiers source /organism="Limulus polyphemus" /db_xref="taxon:6850" /tissue_type="lateral eye" CDS /note="N-terminal protein kinase domain; C-terminal myosin heavy chain head; substrate for PKA" /codon_start=1 /product="myosin III" /protein_id="AAC " /db_xref="GI: " /translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQA NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRDIRAANIMF SKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYR PCIQEIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ BASE COUNT 1201 a 689 c 782 g 1136 t ORIGIN 1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt 3781 aagatacagt aactagggaa aaaaaaaa // GenBank Record: Feature Table /protein_id="AAC "/db_xref="GI: " GenPept IDs

NCBI FieldGuide GenPept: FASTA format >gi| |gb|AAC | myosin III [Limulus polyphemus] MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQANKKVALKIIGHIAENLLDIETEYRIY KAVNGIQFFPEFRGAFFKRGERESDNEVWLGIEFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAV QYLHENSIIHRDIRAANIMFSKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNY TCDVWSIGITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYRPCIQ EIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQPHEKIYVDDLAFLDSP TEEVVLENLEQRYRKGEIYTFAGDVLLTLNPGKVLPLYGDQTAVKYCERGRSDNPPHVFAVADRAYQQML HHKSPQAVILSGVSGSGKSFCTHQVIRHLAFLGAQNKEGMREKLEYLCPLLDTLGNAYTSTNPNSSHFVK ILEVTFTKTGKITGAILFTFLLEARRLTDIPKGERNFHVFYYFYEGLRSEGRLKEFGLEEKNYRYLPELK SSNSPEYVKGYQQFLRALTSLAFTEEEIFAIQKVLAAILLLGETEIQNSAAFKLLGAESSELENTLTQDV NARDVYARAMYLRLFSWIVAVVNRQLSFSRLVFGDVYSVTVIDSPGFENGLHNSLHQLCANVISDNLQNY IQQIIFFKELEEYGEEGVNVPFNLEGGVDHRTLVNKLMDSGQGLLTAISKATQYQRKGESGWMESLQEAD SEELVEFSNVNGKPIVSVKHIFRKVSYDATDLVKKNVEDKTRALTSTMQRSCDPRIRAIFSSENPSPFLS SPRRSSIQENMLLPERTVTDSLHSALSSVLNLASTEDPPHLILCMRPQKKELINDYDSKSVQIQLHALNV LETILIRQFGFARRISFVDFLNRYQYLAFDFNENVELTKENCRLLLLRLKMDGWTLGKNKVFLKYYSEEY LSRIYETHIKKIVKVQAIARKYFVKVRQSKTKPH >gi| |gb|AAA | metC peptide [Escherichia coli MADKKLDTQLVNAGRSKKYSLGAVNSVIQRASSLVFDSVEAKKHATRNRANGELFYGRRGTLTHFSLQQA MCELEGGAGCVLFPCGAAAVANSILAFIEQGDPRVPSSNS

NCBI FieldGuide Seq-entry ::= set { class nuc-prot, descr { title "Limulus polyphemus myosin III mRNA, complete cds.", source { org { taxname "Limulus polyphemus", common "Atlantic horseshoe crab", db { { db "taxon", tag id 6850 } }, orgname { name binomial { genus "Limulus", species "polyphemus" }, lineage "Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus", gcode 1, mgcode 5, div "INV" } }, Abstract Syntax Notation: ASN.1 FASTA Nucleotide FASTA Protein GenPeptGenBank ASN.1

NCBI FieldGuide /************************************************************************ * * asn2ff.c * convert an ASN.1 entry to flat file format, using the FFPrintArray. * **************************************************************************/ #include #include "asn2ff.h" #include "asn2ffp.h" #include "ffprint.h" #include #ifdef ENABLE_ID1 #include #endif FILE *fpl; Args myargs[] = { {"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL}, {"Input is a Seq-entry","F", NULL,NULL,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL}, {"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL}, {"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL}, {"Show Sequence?","T", NULL,NULL,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL}, Toolbox Sources ftp> open ftp.ncbi.nih.gov. ftp> cd toolbox ftp> cd ncbi_tools ftp://ftp.ncbi.nlm.gov/toolbox/ncbi_tools NCBI Toolbox

NCBI FieldGuide Bulk Divisions Expressed Sequence Tag –1 st pass single read cDNA Genome Survey Sequence –1 st pass single read gDNA High Throughput Genomic –incomplete sequences of genomic clones Sequence Tagged Site –PCR-based mapping reagents Batch Submission and htg ( and ftp) Inaccurate Poorly Characterized

NCBI FieldGuide EST Division: Expressed Sequence Tags RNA gene products nucleus 30,000 genes ,000 unique cDNA clones in library - isolate unique clones -sequence once from each end make cDNA library 5’ 3’ >IMAGE: ', mRNA sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTT AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAG GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC >IMAGE: ' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAA TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNC AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG gbdiv_est[Properties]

NCBI FieldGuide ESTs in Entrez Total 21 million records Human 5.6 million Mouse 4.1 million Rat 0.6 million

NCBI FieldGuide A gene-oriented view of sequence entries MegaBlast based automated sequence clustering Now informed by genome hits New! Nonredundant set of gene oriented clusters Each cluster a unique gene Information on tissue types and map locations Includes well-characterized genes and novel ESTs Useful for gene discovery and selection of mapping reagents What is UniGene?

NCBI FieldGuide EST hits A.t. serine protease mRNA A.t. mRNA 5’ EST hits 3’ EST hits

NCBI FieldGuide UniGene June 2004

NCBI FieldGuide Human UniGene 136,416 mRNAs 5,187 models 7,415 HTC 1,416,936 EST, 3'reads 2,092,475 EST, 5'reads + 775,747 EST, other/unknown ,434,176 total sequences in clusters Final Number of Clusters (sets) =============================== total 28,126 contain at least one mRNA 5,750 contain at least one HTC 104,890 contain at least one EST 26,839 contain both mRNAs and ESTs UniGene Build 170 Apr. 24, ,219 3,000,000,000 bp 30 K expected genes 75% excess

NCBI FieldGuide Genome Sequencing - HTG, GSS,(WGS) Draft Sequence ( HTG division ) shredding Whole BAC insert (or genome) cloning isolating assembly sequencing GSS division or trace archive whole genome shotgun assemblies (traditional division)

NCBI FieldGuide HTG Division: Honeybee Draft Sequences Unfinished sequences of BACs Gaps and unordered pieces Finished sequences move to traditional GenBank division

NCBI FieldGuide Maize Genome Survey Sequences Surveys of BAC Libraries BAC end sequences More than 100K per project

NCBI FieldGuide Whole Genome Shotgun Projects Traditional GenBank Divisions projects –Virus – Bacteria – Environmental sequences – Archaea –44 Eukaryotes featuring: Chicken, Rat, Mouse, Dog, Chimpanzee, Human Pufferfish (2) Honeybee, Anopheles, Fruit Flies (2), Silkworm Nematode (C. briggsae) Yeasts (8), Aspergillus (2) Rice

NCBI FieldGuide Whole Genome Shotgun Projects wgs[Properties]

NCBI FieldGuide Derivative Sequence Databases RefSeq TPA

NCBI FieldGuide NCBI Derivative Sequence Data ATTGACTA TTGACA CGTGA ATTGACTA TATAGCCG ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG GenBank TATAGCCG AT GA C ATT GA ATT C C GA ATT C C GA ATT C GA ATT C GA ATT C C GA ATT C C UniGene RefSeq Genome Assembly Labs Curators Algorithms TATAGCCG AGCTCCGATA CCGATGACAA

NCBI FieldGuide RefSeq: NCBI’s Derivative Sequence Database Curated transcripts and proteins –reviewed –human, mouse, rat, fruit fly, zebrafish, arabidopsis microbial genomes (proteins) Model transcripts and proteins Assembled Genomic Regions (contigs) –human genome –mouse genome Chromosome records –Human genome –microbial –organelle ftp://ftp.ncbi.nih.gov/refseq/release / srcdb_refseq[Properties]

NCBI FieldGuide RefSeq Benefits non-redundancy explicitly linked nucleotide and protein sequences updates to reflect current sequence data and biology data validation format consistency distinct accession series stewardship by NCBI staff and collaborators

NCBI FieldGuide RefSeq Accession Numbers mRNAs and Proteins NM_ Curated mRNA NP_ Curated Protein NR_ Curated non-coding RNA XM_ Predicted mRNA XP_ Predicted Protein XR_ Predicted non-coding RNA Gene Records NG_ Reference Genomic Sequence Chromosome NC_ Microbial replicons, organelle genomes, human chromosomes Assemblies NT_ Contig NW_ WGS Supercontig

NCBI FieldGuide Third Party Annotation (TPA) Database Annotations of existing GenBank sequences Allows for community annotation of genomes Direct submissions –BankIt –Sequin tpa[Properties]

NCBI FieldGuide TPA record: WGS Assembly CDS Feature TPA protein

NCBI FieldGuide Human Nucleotide Sequences ISDC 8,480,614 (GenBank/EMBL/DDBJ) PRI 901,843 (WGS 601,710) EST 5,639,828 GSS 905,594 HTG 18,367 STS 76,333 PAT 930,108 RefSeq 29,935 TPA 864

NCBI FieldGuide Other NCBI Databases dbSNP: nucleotide polymorphism Geo: Gene Expression Omnibus microarray and other expression data Gene: gene records Unifies LocusLink and Microbial Genomes Structure: imported structures (PDB) Cn3D viewer, NCBI curation CDD: conserved domain database Protein families (COGs) Single domains (PFAM, SMART, CD)

NCBI FieldGuide NCBI’s SNP Database Primary Database and Derivative (RefSNP) Single Nucleotide Polymorphism Repeat polymorphisms Insertion-Deletion Polymorphisms 24 Species Over 15 million submissions

NCBI FieldGuide Submitted SNP Hemachromatosis SNP

NCBI FieldGuide Non-redundant Computational Analysis BLAST hits to genome, mRNA, protein and structure RefSNP

NCBI FieldGuide Gene Expression Omnibus Expression Data Repository –microarray protein nucleotide –SAGE –Mass Spec

NCBI FieldGuide Expression Platform / Series

NCBI FieldGuide Analysis: GEO DataSets ADH2 time course

NCBI FieldGuide NCBI Structures and Domains

NCBI FieldGuide MM MMDB: Molecular Modeling Data Base Derived from experimentally determined PDB records Value added to PDB records including: –Addition of explicit chemical graph information –Validation –Inclusion of Taxonomy, Citation, –Conversion to ASN.1 data description language Structure neighbors determined by Vector Alignment Search Tool (VAST)

NCBI FieldGuide Structure Summary Cn3D viewer Conserved Domains 3D Domain Neighbors Structure Neighbors

NCBI FieldGuide Cn3D 4.1: C-Src

NCBI FieldGuide VAST: Structure Neighbors Vector Alignment Search Tool For each protein chain, locate SSEs (secondary structure elements), and represent them as individual vectors Human IL-4 IL-4 & Leptin align the vectors

NCBI FieldGuide Cn3D 4.1: Structural Alignment Casein kinase S. pombe Src Kinase H. sapiens Conserved ATP binding site

NCBI FieldGuide Cn3D: Simple Homology Modeling human swordtail

NCBI FieldGuide NCBI’s Conserved Domain Database Multiple sequence alignments PSI-BLAST –based score matrices Sources SMART, PFAM, COGs, New NCBI curated domains –structure informed alignments Stats: –COGS 4,873 –Pfam 5,193 –Smart 653 –NCBI CDD 316

NCBI FieldGuide NCBI CD: Tyrosine Kinase

NCBI FieldGuide Using Cn3D to model domains

NCBI FieldGuide Using Entrez An integrated database search and retrieval system

NCBI FieldGuide WWW Access Entrez & BLAST

NCBI FieldGuide Genomes Taxonomy Entrez: Database Integration PubMed abstracts Nucleotide sequences Protein sequences 3-D Structure Word weight VAST BLAST Phylogeny

NCBI FieldGuide The (ever expanding) Entrez System Entrez PopSet Structure PubMed Books 3D Domains Taxonomy GEO/GDS UniGene Nucleotide Protein Genome OMIM CDD/CDART Journals SNP UniSTS PubMed Central

NCBI FieldGuide Database Searching with Entrez uUsing limits and field restriction to find human MutL homolog uLinking and neighboring with MutL uMapping SNPs onto structure and the genome

NCBI FieldGuide Global Entrez Search

NCBI FieldGuide Document Summaries: MutL[All Fields]

NCBI FieldGuide Entrez Nucleotides: Limits & Preview/Index Tabs

NCBI FieldGuide MutL Entrez Nucleotides: Limits Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Uid Volume Field Restriction Exclude bulk sequences

NCBI FieldGuide MutL Entrez Nucleotides: Limits Title == Definition Exclude Bulk Sequences

NCBI FieldGuide Document Summaries: Limits

NCBI FieldGuide Adding Terms: Preview/Index Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Uid Volume

NCBI FieldGuide Human MutL Search Results

NCBI FieldGuide Human MutL RefSeq GenBank Records

NCBI FieldGuide NM_000249: Links

NCBI FieldGuide Literature Links PubMed OMIM

NCBI FieldGuide NM_000249: PubMed Books

NCBI FieldGuide Books Link

NCBI FieldGuide OMIM: Human Disease Genes Conserved Domain

NCBI FieldGuide Sequence Links NucleotideProtein

NCBI FieldGuide NM_000249: Related Sequences similarity Original GenBank mRNAs Original GenBank genomic Genome Project BAC

NCBI FieldGuide Taxonomy Link The Tax Browser NCBI’s Taxonomy

NCBI FieldGuide Taxonomy Link

NCBI FieldGuide The Tax Browser Nucleotide Protein Structures Popset

NCBI FieldGuide Marsupial PopSets

NCBI FieldGuide Mammalian Phylogenetic Study

NCBI FieldGuide Batch Downloads

NCBI FieldGuide Batch Downloads: FASTA and GI list

NCBI FieldGuide Batch Entrez / Entrez-utilities

NCBI FieldGuide NCBI Protein Databases GenPept GenBank, EMBL, DDBJ CDS translations RefSeq mRNA based (NP_) and genome based (XP_) Swiss-Prot curated high quality protein reviews PIR protein information resource Georgetown University PRF protein resource foundation PDB Protein Databank sequences from structures

NCBI FieldGuide Protein Link BLAST Link Conserved Domains

NCBI FieldGuide Related Proteins: Redundancy Redundant Sequences

NCBI FieldGuide Sequence from MutL structure Related Proteins: Links

NCBI FieldGuide BLink: non-redundant relatives Arabidopsis homolog Conserved Domain

NCBI FieldGuide MLH1 Domain Structure: CDD ATPase Domain Mismatch Repair Domain

NCBI FieldGuide MLH1: ATPase Domain

NCBI FieldGuide ATPase structural alignment ATP Binding site helix

NCBI FieldGuide Variations Human MLH1

NCBI FieldGuide BLink Finding structural models

NCBI FieldGuide Mapping Variation Onto Structure Bacterial DNA mismatch repair proteins Loads sequence alignment and structure in Cn3D

NCBI FieldGuide Mapping Variation Onto Structure Conserved Asn Asn Ile Ile – Val

NCBI FieldGuide Genome Resources

NCBI FieldGuide NM_000249: Genome Links

NCBI FieldGuide Higher Genome Resources

NCBI FieldGuide MLH1: UniGene Cluster

NCBI FieldGuide ESTs in UniGene

NCBI FieldGuide The Map Viewer Genome BLAST page

NCBI FieldGuide Map Viewer: Human MLH1 Customizable Annotation NCBI Assembly EST Hits Gene Annotations Models

NCBI FieldGuide Maps and Options

NCBI FieldGuide Synteny: Mammalian Genomes

NCBI FieldGuide The New Homologene early globin gene A-chain gene B-chain gene frog A chick A mouse Amouse B chick B frog B paralogs orthologs gene duplication No longer UniGene based Protein similarities first Guided by taxonomic tree Includes orthologs and paralogs

NCBI FieldGuide The New Homologene

NCBI FieldGuide C.elegans homolog

NCBI FieldGuide C.elegans homolog

NCBI FieldGuide STOPPED HERE

NCBI FieldGuide Microbial Genomes

NCBI FieldGuide Related Proteins->Genome Links

NCBI FieldGuide Genome Links

NCBI FieldGuide Microbial Genomes

NCBI FieldGuide COGs Analysis E.Coli K12 Genome

NCBI FieldGuide Entrez Genes: integrated gene-based access LocusLink Complete Genomes eukaryotic microbial organelle

NCBI FieldGuide Genes MLH1: Central Resource

NCBI FieldGuide Gene Table

NCBI FieldGuide Gene:Links

NCBI FieldGuide Outside Views