Presentation is loading. Please wait.

Presentation is loading. Please wait.

NCBI FieldGuide A Field Guide part 2 February 14, 2006 UT-Health Science Center National Center for Biotechnology Information.

Similar presentations


Presentation on theme: "NCBI FieldGuide A Field Guide part 2 February 14, 2006 UT-Health Science Center National Center for Biotechnology Information."— Presentation transcript:

1 NCBI FieldGuide A Field Guide part 2 February 14, 2006 UT-Health Science Center National Center for Biotechnology Information

2 NCBI FieldGuide GenBank Records Header Feature Table Sequence The Flatfile Format

3 NCBI FieldGuide A Typical GenBank Record LOCUS NM_ bp mRNA linear INV 28-OCT-2004DEFINITION Mus musculus REV1-like(S. cerevisiae)(Rev1l),mRNAACCESSION NM_019570VERSION NM_ GI: KEYWORDS. = Title

4 NCBI FieldGuide GenBank Record: Feature Table

5 NCBI FieldGuide GenPept identifier GenBank Record: Feature Table, con’t.

6 NCBI FieldGuide GenBank Record: sequence skip

7 NCBI FieldGuide Indexing for Nucleotide UID FieldIndexed Terms [primary accession]NM_ [title]Bos taurus hemochromatosis (hfe), mRNA. [organism]Bos taurus [sequence length]1168 [modification date]2005/02/19 [properties]biomol mrna gbdiv mam srcdb refseq [accn] [orgn] [mdat] [prop]

8 NCBI FieldGuide Global Entrez Search: HFE HFE

9 NCBI FieldGuide Entrez Nucleotide: HFE 137 records Not HFE [Title]

10 NCBI FieldGuide Smarter Query hfe[title] 42 records Curated HFE splice variants (11 total) AND human[orgn]

11 NCBI FieldGuide hfe[title] AND human[orgn] (con’t) Primary data

12 NCBI FieldGuide Preview/Index Gateway to Advanced Searches

13 NCBI FieldGuide Preview/Index

14 NCBI FieldGuide Preview/Index: Properties, srcdb srcdb Properties

15 NCBI FieldGuide Preview/Index: Properties, srcdb …AND srcdb refseq[Properties]

16 NCBI FieldGuide Preview/Index: Properties, srcdb …AND srcdb ddbj/embl/genbank[Properties]

17 NCBI FieldGuide #1 hfe 137 #2 hfe[title] AND human[orgn] 42 #3 #2 AND srcdb refseq[prop] 11 #4 #2 AND srcdb ddbj/embl/genbank[prop] 31 Database Queries #5 #4 AND gbdiv pri[prop] 29 #4 #4 AND gbdiv est[prop] 2 Primate divisiongbdiv pri[prop] EST divisiongbdiv est[prop]

18 NCBI FieldGuide Molecule Queries #1 hfe 116 #2 hfe[title] AND human[orgn] 42 #3 #2 AND biomol mrna[prop] 29 #4 #2 AND biomol genomic[prop] 13 Genomic DNAbiomol genomic[prop] cDNAbiomol mrna[prop]

19 NCBI FieldGuide More Queries… Fields are database-specific Entrez Nucleotide Reviewed RefSeqs with transcript variants: srcdb refseq reviewed[prop] AND transcript[title] AND variant[title]

20 NCBI FieldGuide More Queries… Fields are database-specific Entrez Nucleotide Reviewed RefSeqs with transcript variants: srcdb refseq reviewed[prop] AND transcript[title] AND variant[title] Topoisomerase genes from Archaea: topoisomerase[gene name] AND archaea[organism] Entrez Gene Genes on human chromosome 2 with OMIM links 2[chromosome] AND human[organism] AND “gene omim”[filter] Membrane proteins linked to cancer: “integral to plasma membrane”[gene ontology] AND cancer[dis]

21 NCBI FieldGuide Other Entrez Databases UniSTS: markers on the Genethon map of human chromosome 12 Genethon[Map Name] AND human[organism] AND 12[chromosome] UniGene: rat clusters that have at least one mRNA rat[organism] NOT 0[mrna count] Structure: structures of bacterial kinases with resolutions below 2 Å bacteria[organism] AND kinase AND :002.00[resolution] SNP: uniquely mapped microsatellites on human chr2 microsat[SNP Class] AND 1[Map Weight] AND 2[Chromosome]) AND human[orgn]

22 NCBI FieldGuide Genome ResourcesUniGene Trace Archive Map Viewer Genomic Biology E-PCR

23 NCBI FieldGuide Genomic Biology

24 NCBI FieldGuide Gen Biol: Gen Resources

25 NCBI FieldGuide Map Viewer – Genome Annotation Updates

26 NCBI FieldGuide Gen Biol: Gen Resources

27 NCBI FieldGuide Genome Projects: microb

28 NCBI FieldGuide Genome Projects: microb 13 Eukaryotic Genome Sequencing Projects Selected: Complete – 0, Assembly – 2, In Progress - 11

29 NCBI FieldGuide Genome Projects: microb 13 Eukaryotic Genome Sequencing Projects Selected: Complete – 0, Assembly – 2, In Progress - 11

30 NCBI FieldGuide Gen Biol: Gen Resources

31 NCBI FieldGuide Gen Biol: Gen Resources

32 NCBI FieldGuide Gen Biol: Gen Resources

33 NCBI FieldGuide Gen Biol: Gen Resources

34 NCBI FieldGuide Gen Biol: Gen Resources

35 NCBI FieldGuide Genome Resources UniGene Trace Archive Map Viewer Genomic Biology E-PCR

36 NCBI FieldGuide Gene-oriented clusters of expressed sequences Automatic clustering using MegaBlast Each cluster represents a unique gene Informed by genome hits Information on tissue types and map locations Useful for gene discovery and selection of mapping reagents UniGene

37 NCBI FieldGuide A Cluster of ESTs query 5’ EST hits 3’ EST hits

38 NCBI FieldGuide UniGene Collections

39 NCBI FieldGuide UniGene Collections Species UniGene

40 NCBI FieldGuide UniGene Hs build 188

41 NCBI FieldGuide UniGene Cluster Hs Lipase, hormone-sensitive (LIPE)

42 NCBI FieldGuide UniGene Cluster Hs.95351

43 NCBI FieldGuide UniGene Cluster Hs.95351: expression

44 NCBI FieldGuide UniGene Cluster Hs.95351: seqs

45 NCBI FieldGuide Get Sequences web page ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/

46 NCBI FieldGuide Genome ResourcesUniGene Trace Archive Map Viewer Genomic Biology E-PCR

47 NCBI FieldGuide E-PCR Genomic sequence here

48 NCBI FieldGuide Options

49 NCBI FieldGuide Results

50 NCBI FieldGuide reverse e-pcr

51 NCBI FieldGuide reverse e-pcr

52 NCBI FieldGuide reverse e-pcr

53 NCBI FieldGuide reverse e-pcr GeneSTS LY6G6D: lymphocyte antigen 6 complex, locus G6D

54 NCBI FieldGuide Genome ResourcesUniGene Trace Archive Genomic Biology Map Viewer E-PCR

55 NCBI FieldGuide List View

56 NCBI FieldGuide Human MapViewer adar

57 NCBI FieldGuide MapViewer: Human ADAR

58 NCBI FieldGuide MV Hs ADAR 3’ UTR 5’ UTR

59 NCBI FieldGuide Maps & Options --Sequence maps-- Ab initio Assembly Repeats BES_Clone Clone NCI_Clone Contig Component CpG island dbSNP haplotype Fosmid GenBank_DNA Gene Phenotype SAGE_Tag STS TCAG_RNA Transcript (RNA) Hs_UniGene Hs_EST --Cytogenetic maps-- Ideogram FISH Clone Gene_Cytogenetic Mitelman Breakpoint Morbid/Disease --Genetic Maps-- deCODE Genethon Marshfield --RH maps-- GeneMap99-G3 GeneMap99-GB4 NCBI RH Standford-G3 TNG Whitehead-RH Whitehead-YAC Mm_UniGene Mm_EST Rn_UniGene Rn_EST Ssc_UniGene Ssc_EST Bt_UniGene Bt_EST Gga_UniGene Gga_EST Variation Maps & Options = SNP

60 NCBI FieldGuide MapViewer UniGene Component Repeats Gene

61 NCBI FieldGuide Gene PhenotypeVariation

62 NCBI FieldGuide Maps & Options

63 NCBI FieldGuide Human ADAR Chimp ADAR Mouse ADAR

64 NCBI FieldGuide Genome ResourcesUniGene Map Viewer Genomic Biology Trace Archive E-PCR

65 NCBI FieldGuide Trace Archive Page

66 NCBI FieldGuide Ciona savignyi Traces

67 NCBI FieldGuide

68 Trace Archive BLAST Page Potential access to sequences NOT yet in GenBank

69 NCBI FieldGuide Basic Local Alignment Search Tool

70 NCBI FieldGuide BLAST Web Searches, ,000

71 NCBI FieldGuide  Nucleotide or protein:Related Sequences  BLAST link:BLink Precomputed BLAST Services  Transcript clusters:UniGene  Protein homologs:HomoloGene

72 NCBI FieldGuide Link to Related Sequences

73 NCBI FieldGuide Related Sequences Most similar Least similar

74 NCBI FieldGuide BLink (BLAST Link)

75 NCBI FieldGuide BLink Output Best hits 3D structures CDD-Search

76 NCBI FieldGuide Fast - heuristic approach based on Smith Waterman Local alignments Statistical significance - Expect value Versatile - blastn, blastp, blastx, tblastn, tblastx, rps-blast, psi-blast - www, standalone, and network clients Why Is BLAST So Popular?

77 NCBI FieldGuide Global vs Local Alignment Seq 1 Seq 2 Seq 1 Seq 2 Global alignment Local alignment

78 NCBI FieldGuide Global vs Local Alignment Seq1: WHEREISWALTERNOW (16aa) Seq2: HEWASHEREBUTNOWISHERE (21aa) Global Seq1:1 W--HEREISWALTERNOW 16 W HERE Seq2:1 HEWASHEREBUTNOWISHERE 21 Local Seq1: 1 W--HERE 5 W HERE W HERE Seq2: 3 WASHERE 9 Seq2: 15 WISHERE 21

79 NCBI FieldGuide How BLAST Works 1.Make lookup table of “words” for query 2.Scan database for hits 3.Extend alignment both directions –Ungapped extensions of hits (initial HSPs) –Gapped extensions (no traceback) –Gapped extensions (traceback - alignment details) 1.Make lookup table of “words” for query 2.Scan database for hits 3.Extend alignment both directions –Ungapped extensions of hits (initial HSPs) –Gapped extensions (no traceback) –Gapped extensions (traceback - alignment details)

80 NCBI FieldGuide Protein Words GTQITVEDLFYNIATRRKALKN Query : Neighborhood Words VTV, LTV, VSV, etc. GTQ TQI QIT ITV TVE VED EDL DLF... Make a lookup table of words Word size = 3 (default) Word size can only be 2 or 3 VTV 12 LTV 11 VSV 8 Neighborhood score threshold

81 NCBI FieldGuide BLASTP Summary YLS HFL Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333 Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI 47 +E YA YL K F+ L +SP+ +DVNVHP+K V +++ I HFL 18 HFV 15 HFS 14 HWL 13 NFL 13 DFL 12 HWV 10 etc … YLS 15 YLT 12 YVS 12 YIT 10 etc … Neighborhood words Neighborhood score threshold T (-f) =11 Query: IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEV… example query words Drop-off score = Highest score – current score -X X dropoff value for gapped alignment (in bits) blastn 30, megablast 20, tblastx 0, all others 15

82 NCBI FieldGuide YLS HFL Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333 Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI 47 Gapped extension with trace back Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI-LEV… 50 +E YA YL K F+YLSL +SP+ +DVNVHP+K VHFL+++ I + + Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEIATSI… 337 Final HSP +E YA YL K F+ L +SP+ +DVNVHP+K V +++ I High-scoring pair (HSP) HFL 18 HFV 15 HFS 14 HWL 13 NFL 13 DFL 12 HWV 10 etc … YLS 15 YLT 12 YVS 12 YIT 10 etc … Neighborhood words Neighborhood score threshold T (-f) =11 Query: IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEV… example query words BLASTP Summary

83 NCBI FieldGuide Scoring Systems - Nucleotides A G C T A +1 –3 –3 -3 G –3 +1 –3 -3 C –3 – T –3 –3 –3 +1 Identity matrix CAGGTAGCAAGCTTGCATGTCA || |||||||||||| ||||| raw score = 19-9 = 10 CACGTAGCAAGCTTG-GTGTCA [ -r 1 -q -3 ]

84 NCBI FieldGuide Scoring Systems - Proteins Position Independent Matrices PAM Matrices (Percent Accepted Mutation) Derived from observation; small dataset of alignments Implicit model of evolution All calculated from PAM1 PAM250 widely used BLOSUM Matrices (BLOck SUbstitution Matrices) Derived from observation; large dataset of highly conserved blocks Each matrix derived separately from blocks with a defined percent identity cutoff BLOSUM62 - default matrix for BLAST Position Specific Score Matrices (PSSMs) PSI- and RPS-BLAST

85 NCBI FieldGuide A 4 R -1 5 N D C Q E G H I L K M F P S T W Y V X A R N D C Q E G H I L K M F P S T W Y V X BLOSUM62 D F Negative for less likely substitutions D Y F Positive for more likely substitutions

86 NCBI FieldGuide Position-Specific Score Matrix DAF-1 Serine/Threonine protein kinases catalytic loop 174 PSSM scores 5 4

87 NCBI FieldGuide A R N D C Q E G H I L K M F P S T W Y V 435 K E S N K P A M A H R D I K S K N I M V K N D L Position-Specific Score Matrix catalytic loop

88 NCBI FieldGuide Local Alignment Statistics (applies to ungapped alignments) E = Kmne - S or E = mn2 -S’ K = scale for search space = scale for scoring system S’ = bitscore = ( S - lnK)/ln2 Expect Value E = number of database hits you expect to find by chance, ≥ S More info: The Statistics of Sequence Similarity Scores

89 NCBI FieldGuide An Alignment BLAST Cannot Make 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC Reason: no contiguous exact match of 7 bp.

90 NCBI FieldGuide BLAST 2 Sequences (blastx) output: An Alignment BLAST Can Make Solution: compare protein sequences; BLASTX Score = 290 bits (741), Expect = 7e-77 Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%) Frame = +3

91 NCBI FieldGuide Other BLAST Algorithms Megablast Discontiguous Megablast PSI-BLAST PHI-BLAST

92 NCBI FieldGuide Megablast: NCBI’s Genome Annotator Long alignments of similar DNA sequences Greedy algorithm Concatenation of query sequences Faster than blastn; less sensitive

93 NCBI FieldGuide MegaBLAST & Word Size Trade-off: sensitivity vs speed 23 blastp 828 megablast 711 blastn minimumdefaultWORD SIZE

94 NCBI FieldGuide Discontiguous Megablast Uses discontiguous word matches Better for cross-species comparisons

95 NCBI FieldGuide Templates for Discontiguous Words W = 11, t = 16, coding: W = 11, t = 16, non-coding: W = 12, t = 16, coding: W = 12, t = 16, non-coding: W = 11, t = 18, coding: W = 11, t = 18, non-coding: W = 12, t = 18, coding: W = 12, t = 18, non-coding: W = 11, t = 21, coding: W = 11, t = 21, non-coding: W = 12, t = 21, coding: W = 12, t = 21, non-coding: Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics March, 2002; 18(3):440-5 W = word size; # matches in template t = template length

96 NCBI FieldGuide

97 Discontiguous (Cross-species) MegaBLAST

98 NCBI FieldGuide Discontiguous Word Options

99 NCBI FieldGuide Disco. Megablast Example...  Discontiguous megaBLAST = numerous hits... Query: NM_ Drosophila melanogaster CG18582-PA (mbt) mRNA, (3244 bp) /note= mushroom bodies tiny; synonyms: Pak2, STE20, dPAK2  MegaBLAST = “No significant similarity found.” Database: nr (nt), Mammalia[orgn]

100 NCBI FieldGuide Ex: Discontiguous MegaBLAST

101 NCBI FieldGuide Ex: BLASTN

102 NCBI FieldGuide PSI-BLAST Example: Confirming relationships of purine nucleotide metabolism proteins Position-specific Iterated BLAST

103 NCBI FieldGuide >gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINE MAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGF VIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVD EQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAY RTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGA VRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKK PSI-BLAST E value cutoff for PSSM

104 NCBI FieldGuide RESULTS: Initial BLASTP Same results as protein-protein BLAST; different format

105 NCBI FieldGuide Results of First PSSM Search Other purine nucleotide metabolizing enzymes not found by ordinary BLAST

106 NCBI FieldGuide Tenth PSSM Search: Convergence Just below threshold, another nucleotide metabolism enzyme Check to add to PSSM

107 NCBI FieldGuide PHI-BLAST >gi|231729|sp|P30429|CED4_CAEEL CELL DEATH PROTEIN 4 MLCEIECRALSTAHTRLIHDFEPRDALTYLEGKNIFTEDHSELISKMSTRLERIANFLRIYRRQASE LIDFFNYNNQSHLADFLEDYIDFAINEPDLLRPVVIAPQFSRQMLDRKLLLGNVPKQMTCYIREYHV IKKLDEMCDLDSFFLFLHGRAGSGKSVIASQALSKSDQLIGINYDSIVWLKDSGTAPKSTFDLFTDI LKSEDDLLNFPSVEHVTSVVLKRMICNALIDRPNTLFVFDDVVQEETIRWAQELRLRCLVTTRDVEI ASQTCEFIEVTSLEIDECYDFLEAYGMPMPVGEKEEDVLNKTIELSSGNPATLMMFFKSCEPKTFEK [GA]xxxxGK[ST]

108 NCBI FieldGuide What’s New?

109 NCBI FieldGuide BLAST Databases Nucleotide refseq_rna = NM_*, XM_* refseq_genomic = NC_*, NG_* env_nt –environmental sample[filter], e.g., 16S rRNA Protein refseq = NP_*, XP_* env_nr nr = nr

110 NCBI FieldGuide New Formatter Select lower case Select red

111 NCBI FieldGuide BLAST Output: Alignments & Filter low complexity sequence filtered

112 NCBI FieldGuide BLAST Output: CDS Feature

113 NCBI FieldGuide Advanced Options Limit to Organism all[filter] NOT ma Example Entrez Queries all[Filter] NOT mammalia[Organism] ray finned fishes[Organism] srcdb refseq[Properties] Nucleotide only: biomol mrna[Properties] biomol genomic[Properties] OtherAdvanced –e 10000expect value -v 2000descriptions -b 2000alignments Example Entrez Queries all[Filter] NOT mammalia[Organism] ray finned fishes[Organism] srcdb refseq[Properties] Nucleotide only: biomol mrna[Properties] biomol genomic[Properties] OtherAdvanced –e 10000expect value -v 2000descriptions -b 2000alignments -e v 2000

114 NCBI FieldGuide Genome BLAST

115 NCBI FieldGuide Genome BLAST via Map Viewer

116 NCBI FieldGuide Example: Human Genome BLAST TGCCTCCTTTGGTGAAGGTGACACATCATGTGACCTCTTCAGTGAC CACTCTACGGTGTCGGGCCTTGAACTACTACCCCCAGAAC ATCACCATGAAGTGGCTGAAGGATAAGCAGCCAATGGATGCCAAG GAGTTCGAACCTAAAGACGTATTGCCCAATGGGGATGGGAC CTACCAGGGCTGGATAACCTTGGCTGTACCCCCTGGGGAAGAGC Human EST

117 NCBI FieldGuide Human Genome BLAST: Results

118 NCBI FieldGuide Human Genome BLAST: MapViewer Entrez Gene

119 NCBI FieldGuide Example: Mapping Oligos Onto a Genome >forward CCATGGCGACCCTGGAAAAGC >reverse CAGCAGCGGCTGTGCCTGCGG ? ? ?

120 NCBI FieldGuide Map Oligos Onto Genome >CCATGGCGACCCTGGAAAAGCNNNNNNNNNNCAGCAGCGGCTGTGCCTGCGG -W 7 –e 1000 forward primer reverse primer

121 NCBI FieldGuide Genome BLAST Results

122 NCBI FieldGuide Primer Alignments forward primer reverse primer

123 NCBI FieldGuide MapViewer

124 NCBI FieldGuide MapViewer

125 NCBI FieldGuide Sequence View (sv) forward reverse

126 NCBI FieldGuide Service Addresses BLAST General Help Wayne Matten BLAST General Help Wayne Matten


Download ppt "NCBI FieldGuide A Field Guide part 2 February 14, 2006 UT-Health Science Center National Center for Biotechnology Information."

Similar presentations


Ads by Google