Presentation is loading. Please wait.

Presentation is loading. Please wait.

NCBI FieldGuide September 29, 2004 ICGEB NCBI Molecular Biology Resources A Field Guide part 1.

Similar presentations


Presentation on theme: "NCBI FieldGuide September 29, 2004 ICGEB NCBI Molecular Biology Resources A Field Guide part 1."— Presentation transcript:

1 NCBI FieldGuide September 29, 2004 ICGEB NCBI Molecular Biology Resources A Field Guide part 1

2 NCBI FieldGuide About NCBI The NCBI Entrez System NCBI Sequence Databases NCBI Genomic Resources ** Intermission ** NCBI Precomputed Resources –Behind the scenes NCBI Resources

3 NCBI FieldGuide The National Institutes of Health Bethesda, MD

4 NCBI FieldGuide The National Center for Biotechnology Information Created as a part of NLM in 1988 –Establish public databases –Perform research in computational biology –Develop software tools for sequence analysis –Disseminate biomedical information

5 NCBI FieldGuide Number of Users and Hits Per Day 1997 1998 1999 2000 2001 2002 2003 Christmas & New Year’s Days Currently averaging 10,000,000 to 35,000,000 hits per day!

6 NCBI FieldGuide Countries of Origin

7 NCBI FieldGuide Web Access: http://www.ncbi.nlm.nih.gov

8 NCBI FieldGuide http://www.ncbi.nlm.nih.gov/About/index.html

9 NCBI FieldGuide

10

11

12

13 A part of the NCBI Bookshelf Part 1. The Databases Part 3. Querying and Linking the Data Part 2. Data Flow and Processing Part 4. User Support

14 NCBI FieldGuide

15

16

17 OMIM - A catalogue of genes involved with human disease processes - Detailed clinical and reference information - Curated and maintained by Johns Hopkins - Links to PubMed and sequence databases

18 NCBI FieldGuide The Entrez System Entrez Nucleotide PubMed Protein Taxonomy Structur e Domains3D Domains Journal s PMC OMIM Books PopSet SNP UniGene UniST S Genome Gene GEO GEO Datasets MeSH CancerChromosomes Homologen e

19 NCBI FieldGuide Taxonomy

20 NCBI FieldGuide zebrafish

21

22 NCBI FieldGuide The Global Entrez search engine

23 NCBI FieldGuide

24

25 Types of Databases Primary Databases –Original submissions by experimentalists –Database staff review and may organize the data, but we don’t add/modify additional information –Records are “owned” and updated by their authors Examples: GenBank, SNP, GEO Derivative Databases –Human-curated (compilation and correction of data)  Examples: Gene(LocusLink), Structure & Literature databases –Computationally-Derived  Example: UniGene –Combination  Examples: RefSeq, Genome Assembly, Domain databases

26 NCBI FieldGuide Primary vs. Derivative Sequence DatabasesGenBank SequencingCenters GA ATT C C GA ATT C C AT GA ATT C C GA ATT C C TTGACA ATTGACTA ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG C ATT GA ATT C C GA ATT C C Labs Algorithms UniGene Curators RefSeq Genome Assembly TATAGCCG AGCTCCGATA CCGATGACAA Updated continually by NCBI Updated ONLY by submitters

27 NCBI FieldGuide How to Query a Particular Database (term1[ tag delimiter ] op term2[ tag delimiter ] op …) tag delimiter = Entrez indexing field op = AND, OR, NOT Organism Journal User compounds Author  Boolean operators MUST be in ALL CAPS! Examples of tag delimiters term1 term2

28 NCBI FieldGuide Sample Query Brauninger a c-src kinase Organism Journal User compounds Author

29 NCBI FieldGuide Using Fields to Find Records Accession All Fields Author EC/RN Number Feature Key Filter Gene Name Issue Journal Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Volume Most useful search field [Organism]: –human[orgn] …or… bacteria[orgn] Useful search terms in [Properties] field: –srcdb: “source database” ( srcdb genbank[prop] ) –gbdiv: “genbank division” ( gbdiv est[prop] ) –biomol: “biomolecular type” ( biomol mrna[prop] )

30 NCBI FieldGuide #1: thyroid peroxidase 335 #2: thyroid peroxidase AND human[orgn] 291 #3: thyroid peroxidase[title] AND human[orgn] 166 #4: #3 AND srcdb refseq[prop] 5 #5: #3 AND srcdb ddbj/embl/genbank[prop] 161 #6: #5 AND gbdiv est[prop] 20 #7: #5 AND gbdiv pri[prop] 141 #8: #7 AND biomol genomic[prop] 25 #9: #7 AND biomol mrna[prop] 116 Using Field Limits

31 NCBI FieldGuide Complex searches you can do with Preview/Index How many rat Unigene clusters contain at least one mRNA? rat [organism] Terms used (and indexed) in Entrez fields can be searched to gain useful information! 1)Select the UniGene database. 2)Find all the rat records. 3)Find those that have ≥ 1 mRNAs. (“not 0”) NOT

32 NCBI FieldGuide Complex Queries with Preview/Index NOT 0 [mRNA Count]

33 NCBI FieldGuide 1º Sequence Database GenBank Nucleotide only sequence database Archival in nature Submission of GenBank Data to NCBI – Direct submissions of individual records via Web (BankIt, Sequin) – Batch submissions of bulk sequences via Email (EST, GSS, STS) – FTP accounts for Sequencing Centers

34 NCBI FieldGuide Sequence Records (millions) Total Base Pairs (billions) GenBank 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 40 Sequence records Total base pairs Release 143: 37.3 million records 41.8 billion nucleotides Average doubling time ≈ 14 months ’83 ’84 ’85 ’86 ’87 ’88 ’89 ’90 ’91 ’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01 ’02 ’03 ’04

35 NCBI FieldGuide full release every two months incremental and cumulative updates daily available only through internet ftp://ftp.ncbi.nih.gov/genbank/ GenBank Release 143August 2004 37,343,937 Records 41,808,045,653 Nucleotides >170,000Species 160 Gigabytes 657 files

36 NCBI FieldGuide EBI GenBank DDBJ EMBL EMBL Entrez SRS getentry NIG CIB NCBI NIH Submissions Updates Submissions Updates Submissions Updates The International Sequence Database Collaboration Sequin BankIt ftp

37 NCBI FieldGuide EST (335) Expressed Sequence Tag GSS (116) Genome Survey Sequence HTG (61) High Throughput Genomic STS (5) Sequence Tagged Site HTC (6) High Throughput cDNA PRI (28) Primate PLN (12) Plant and Fungal BCT (10) Bacterial and Archeal INV (6) Invertebrate ROD (13) Rodent VRL (3) Viral VRT (7) Other Vertebrate MAM (1) Mammalian (ex. ROD and PRI) PHG (1) Phage SYN (1) Synthetic (cloning vectors) UNA (1) Unannotated Organization of GenBank: GenBank Divisions (gbdiv) Records are divided into 17 Divisions. - 1 Patent (11 files) - 5 High Throughput - 11 Traditional Traditional Divisions: Direct Submissions (Sequin and BankIt) Accurate Well characterized BULK Divisions: Batch Submission (Email and FTP) Inaccurate Poorly characterized

38 NCBI FieldGuide File Formats of the Sequence Databases Each sequence is represented by a text record called a flat file. GenBank/GenPept (useful for scientists) FASTA (the simplest format) ASN.1 & XML (useful for programmers)

39 NCBI FieldGuide LOCUS AF062069 3808 bp mRNA INV 02-MAR-2000 DEFINITION Limulus polyphemus myosin III mRNA, complete cds. ACCESSION AF062069 VERSION AF062069.2 GI:7144484 KEYWORDS. SOURCE Atlantic horseshoe crab. ORGANISM Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. REFERENCE 1 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein JOURNAL J. Neurosci. (1998) In press REFERENCE 2 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REFERENCE 3 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REMARK Sequence update by submitter COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700. References DEFINITION Limulus polyphemus myosin III mRNA, complete cds.LOCUS AF0620069 3808 bp mRNA INV 02-MAR-2000 ORGANISM Limulus polyphemus Eukaryota;Metazoa;Arthropoda;Chelicerata;Merostomata; Xiphosura;Limulidae;Limulus. A Traditional “GenBank” Record Definition =Title ACCESSION AF062069 VERSION AF062069.2 GI:7144484 NCBI’s Taxonomy Accession.Version GI Number Accession Number Length mRNA = cDNA DNA = genomic Division Date of most recent modification

40 NCBI FieldGuide FEATURES Location/Qualifiers source 1..3808 /organism="Limulus polyphemus" /db_xref="taxon:6850" /tissue_type="lateral eye" CDS 258..3302 /note="N-terminal protein kinase domain; C-terminal myosin heavy chain head; substrate for PKA" /codon_start=1 /product="myosin III" /protein_id="AAC16332.2" /db_xref="GI:7144485" /translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQA NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRDIRAANIMF SKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYR PCIQEIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ BASE COUNT 201 a 689 c 782 g 1136 t ORIGIN 1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt 3781 aagatacagt aactagggaa aaaaaaaa // Lower down in the GenBank Record /protein_id="AAC16332.2" /db_xref="GI:7144485" GenPept Protein ID Feature Table

41 NCBI FieldGuide FASTA format >gi|4680721|gb|AAA61217.2| thyroid peroxidase [Homo sapiens] MRALAVLSVTLVMACTEAFFPFISRGKELLWGKPEESRVSSVLEESKRLVDTAMYATMQRNLKKRGILSG AQLLSFSKLPEPTSGVIARAAEIMETSIQAMKRKVNLKTQQSQHPTDALSEDLLSIIANMSGCLPYMLPP KCPNTCLANKYRPITGACNNRDHPRWGASNTALARWLPPVYEDGFSQPRGWNPGFLYNGFPLPPVREVTR HVIQVSNEVVTDDDRYSDLLMAWGQYIDHDIAFTPQSTSKAAFGGGSDCQMTCENQNPCFPIQLPEEARP AAGTACLPFYRSSAACGTGDQGALFGNLSTANPRQQMNGLTSFLDASTVYGSSPALERQLRNWTSAEGLL RVHGRLRDSGRAYLPFVPPRAPAACAPEPGNPGETRGPCFLAGDGRASEVPSLTALHTLWLREHNRLAAA LKALNAHWSADAVYQEARKVVGALHQIITLRDYIPRILGPEAFQQYVGPYEGYDSTANPTVSNVFSTAAF RFGHATIHPLVRRLDASFQEHPDLPGLWLHQAFFSPWTLLRGGGLDPLIRGLLARPAKLQVQDQLMNEEL TERLFVLSNSSTLDLASINLQRGRDHGLPGYNEWREFCGLPRLETPADLSTAIASRSVADKILDLYKHPD NIDVWLGGLAENFLPRARTGPLFACLIGKQMKALRDGDWFWWENSHVFTDAQRRELEKHSLSRVICDNTG LTRVPMDAFQVGKFPEDFESCDSITGMNLEAWRETFPQDDKCGFPESVENGDFVHCEESGRRVLVYSCRH GYELQGREQLTCTQEGWDFQPPLCKDVNECADGAHPPCHASARCRNTKGGFQCLCADPYELGDDGRTCVD... >gi|4680720|gb|M17755.2|HUMTPOC Homo sapiens thyroid peroxidase (TPO) mRNA, complete cds GAGGCAATTGAGGCGCCCATTTCAGAAGAGTTACAGCCGTGAAAATTACTCAGCAGTGCAGTTGGCTGAG AAGAGGAAAAAAGAATGAGAGCGCTGGCTGTGCTGTCTGTCACGCTGGTTATGGCCTGCACAGAAGCCTT CTTCCCCTTCATCTCGAGAGGGAAAGAACTCCTTTGGGGAAAGCCTGAGGAGTCTCGTGTCTCTAGCGTC TTGGAGGAAAGCAAGCGCCTGGTGGACACCGCCATGTACGCCACGATGCAGAGAAACCTCAAGAAAAGAG GAATCCTTTCTGGAGCTCAGCTTCTGTCTTTTTCCAAACTTCCTGAGCCAACAAGCGGAGTGATTGCCCG AGCAGCAGAGATAATGGAAACATCAATACAAGCGATGAAAAGAAAAGTCAACCTGAAAACTCAACAATCA CAGCATCCAACGGATGCTTTATCAGAAGATCTGCTGAGCATCATTGCAAACATGTCTGGATGTCTCCCTT ACATGCTGCCCCCAAAATGCCCAAACACTTGCCTGGCGAACAAATACAGGCCCATCACAGGAGCTTGCAA CAACAGAGACCACCCCAGATGGGGCGCCTCCAACACGGCCCTGGCACGATGGCTCCCTCCAGTCTATGAG GACGGCTTCAGTCAGCCCCGAGGCTGGAACCCCGGCTTCTTGTACAACGGGTTCCCACTGCCCCCGGTCC GGGAGGTGACAAGACATGTCATTCAAGTTTCAAATGAGGTTGTCACAGATGATGACCGCTATTCTGACCT CCTGATGGCATGGGGACAATACATCGACCACGACATCGCGTTCACACCACAGAGCACCAGCAAAGCTGCC...

42 NCBI FieldGuide Seq-entry ::= set { level 1, class nuc-prot, descr { title "Human thyroid peroxidase mRNA, partial cds., and translated products", source { org { taxname "Homo sapiens", common "human", db { { db "taxon", tag id 9606 } }, orgname { name binomial { genus "Homo", species "sapiens" }, lineage "Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo", Abstract Syntax Notation: ASN.1 FASTA Nucleotide FASTA Protein GenPeptGenBank ASN.1

43 NCBI FieldGuide Bulk Divisions Expressed Sequence Tag –1 st pass single read cDNA Genome Survey Sequence –1 st pass single read gDNA High Throughput Genomic –incomplete sequences of genomic clones Sequence Tagged Site –PCR-based mapping reagents Batch Submission and htg (email and ftp) Inaccurate Poorly Characterized

44 NCBI FieldGuide EST Division: Expressed Sequence Tags RNA gene products nucleus 30,000 genes 80-100,000 unique cDNA clones in library - isolate unique clones -sequence once from each end make cDNA library 5’ 3’ >IMAGE:275615 3', mRNA sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTT AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAG GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC >IMAGE:275615 5' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAA TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNC AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG gbdiv_est[Properties]

45 NCBI FieldGuide Genome Sequencing - HTG, GSS,(WGS) Draft Sequence ( HTG division ) shredding Whole BAC insert (or genome) cloning isolating assembly sequencing GSS division or trace archive whole genome shotgun assemblies (traditional division)

46 NCBI FieldGuide HTG Division: Honeybee Draft Sequences Unfinished sequences of BACs Gaps and unordered pieces Finished sequences move to traditional GenBank division

47 NCBI FieldGuide Other Primary Databases GEOGEO (Gene Expression Omnibus) –Searchable microarray data repository SNPSNP (Single Nucleotide Polymorphism) –Allelic variations (including minisatellites/ simple sequence repeats and insertions/ deletions)

48 NCBI FieldGuide Submit and update data Query the database: gene identifiers field information sequence Browse datasets Download data Redesigned with new features

49 NCBI FieldGuide GPL Platform descriptions GSM Raw/processed spot intensities from a single slide/chip GSE Grouping of slide/chip data “a single experiment” GDS Grouping of experiments Curated by NCBI Submitted by Experimentalists Submitted by Manufacturer* Entrez GEO Entrez GEO Datasets

50 NCBI FieldGuide src1: CMV infected fibroblasts src2: uninfected fibroblasts GSM827 : FHCMV-T-1 GSM825 : FHCMV-T-2 GSM828 : FHCMV-T-3 GSM829 : FHCMV-H-1 GSM830 : FHCMV-H-2 GSM831 : FHCMV-H-3 GSM832 : CMV_AD169-2 GSM833 : CMV_AD169-3 GDS177: CMV infection of HFF cells Comparison of gene expression profiles of HFF cells infected with CMV strains FHCRC non-commercial human 18K array Expression

51 NCBI FieldGuide PRNP

52

53

54

55

56

57 SNP - GeneView


Download ppt "NCBI FieldGuide September 29, 2004 ICGEB NCBI Molecular Biology Resources A Field Guide part 1."

Similar presentations


Ads by Google