Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced BLAST Searching

Similar presentations


Presentation on theme: "Advanced BLAST Searching"— Presentation transcript:

1 Advanced BLAST Searching
Part 2 of 2 September 17, 2003

2 Copyright notice Many of the images in this powerpoint presentation
are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN ). Copyright © 2003 by John Wiley & Sons, Inc. These images and materials may not be used without permission from the publisher. We welcome instructors to use these powerpoints for educational purposes, but please acknowledge the source. The book has a homepage at Including hyperlinks to the book chapters.

3 PSI-BLAST alignment of RBP and b-lactoglobulin: iteration 2
Score = 140 bits (353), Expect = 1e-32 Identities = 45/176 (25%), Positives = 78/176 (43%), Gaps = 33/176 (18%) Query: 4 VWALLLLAAWAAAERDCRVSSF RVKENFDKARFSGTWYAMAKKDPEGLFLQD 55 V L+ LA A F V+ENFD ++ G WY + +K P Sbjct: 2 VTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEI-EKIPASFEKGN 60 Query: 56 NIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMV---GTFTDTEDPAKFKMKYWGVASF 112 I A +S+ E G + K D + V PAK Sbjct: 61 CIQANYSLMENGNIEVLNKEL-----SPDGTMNQVKGEAKQSNVSEPAKLEVQFFPL Query: 113 LQKGNDDHWIVDTDYDTYAVQYSCR----LLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC L ++D R+P LPPE Sbjct: MPPAPYWILATDYENYALVYSCTTFFWLFHVD------FFWILGRNPY-LPPET 159 Page 142

4 PSI-BLAST alignment of RBP and b-lactoglobulin: iteration 3
Score = 159 bits (404), Expect = 1e-38 Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%) Query: 3 WVWALLLLAAWAAAERD CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54 V L+ LA A S V+ENFD ++ G WY + K Sbjct: 1 MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59 Query: 55 DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ 114 + I A +S+ E G + K V PAK Sbjct: 60 NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC R+P LPPE Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159 Page 142

5 1 3 Page 142 Score = 46.2 bits (108), Expect = 2e-04
Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%) Query: 27 VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86 V+ENFD ++ G WY + +K P I A +S+ E G + K Sbjct: 33 VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK ELS 82 Query: 87 ADMVGTF TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137 D GT PAK WI+ TDY+ YA+ YSC Sbjct: 83 PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135 Query: LLNLDGTCADSYSFVFSRDPNGLPPE 163 L ++D R+P LPPE Sbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE 158 3 Score = 159 bits (404), Expect = 1e-38 Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%) Query: 3 WVWALLLLAAWAAAERD CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54 V L+ LA A S V+ENFD ++ G WY + K Sbjct: 1 MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59 Query: 55 DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ 114 + I A +S+ E G + K V PAK Sbjct: 60 NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC R+P LPPE Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159 Page 142

6 The universe of lipocalins (each dot is a protein)
retinol-binding protein odorant-binding protein apolipoprotein D Page 143

7 Scoring matrices let you focus on the big (or small) picture
retinol-binding protein your RBP query Page 143

8 Scoring matrices let you focus on the big (or small) picture
PAM250 PAM30 retinol-binding protein retinol-binding protein Blosum80 Blosum45 Page 143

9 PSI-BLAST generates scoring matrices more powerful than PAM or BLOSUM
retinol-binding protein retinol-binding protein Page 143

10 PSI-BLAST: performance assessment
Evaluate PSI-BLAST results using a database in which protein structures have been solved and all proteins in a group share < 40% amino acid identity. Page 143

11 PSI-BLAST: the problem of corruption
PSI-BLAST is useful to detect weak but biologically meaningful relationships between proteins. The main source of false positives is the spurious amplification of sequences not related to the query. For instance, a query with a coiled-coil motif may detect thousands of other proteins with this motif that are not homologous. Once even a single spurious protein is included in a PSI-BLAST search above threshold, it will not go away. Page 144

12 PSI-BLAST: the problem of corruption
Corruption is defined as the presence of at least one false positive alignment with an E value < 10-4 after five iterations. Three approaches to stopping corruption: [1] Apply filtering of biased composition regions [2] Adjust E value from (default) to a lower value such as E = [3] Visually inspect the output from each iteration. Remove suspicious hits by unchecking the box. Page 144

13

14 Page 152

15 Page 152

16 PHI-BLAST: Pattern hit initiated BLAST
Launches from the same page as PSI-BLAST Combines matching of regular expressions with local alignments surrounding the match. Page 145

17 PHI-BLAST: Pattern hit intiated BLAST
Launches from the same page as PSI-BLAST Combines matching of regular expressions with local alignments surrounding the match. Given a protein sequence S and a regular expression pattern P occurring in S, PHI-BLAST helps answer the question: What other protein sequences both contain an occurrence of P and are homologous to S in the vicinity of the pattern occurrences? PHI-BLAST may be preferable to just searching for pattern occurrences because it filters out those cases where the pattern occurrence is probably random and not indicative of homology. Page 145

18 Align three lipocalins (RBP and two bacterial lipocalins)
ecblc MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD vc MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLD hsrbp ~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD Page 145

19 Pick a small, conserved region and see which amino acid
residues are used ecblc MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD vc MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLD hsrbp ~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD GTWYEI K AV M Page 145

20 Create a pattern using the appropriate syntax
ecblc MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD vc MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLD hsrbp ~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD GTWYEI K AV M GXW[YF][EA][IVLM] Page 145

21 Page 146

22 Page 147

23 Syntax rules for PHI-BLAST
The syntax for patterns in PHI-BLAST follows the conventions of PROSITE (protein lecture, Chapter 8). When using the stand-alone program, it is permissible to have multiple patterns. When using the Web-page only one pattern is allowed per query. [ ] means any one of the characters enclosed in the brackets e.g., [LFYT] means one occurrence of L or F or Y or T means nothing x(5) means 5 positions in which any residue is allowed x(2,4) means 2 to 4 positions where any residue is allowed

24 BLAST for gene discovery
You can use BLAST to find a “novel” gene Page 147

25 BLAST for gene discovery
You can use BLAST to find a “novel” gene Note to students taking this class for credit: You will need to do this for 40% of your grade. In the first three years of this course, everyone has succeeded at this exercise. Page 147

26 Start with the sequence of a known protein Page 148

27 Search a DNA database (e.g. HTGS, dbEST, or genomic sequence from a specific organism) Start with the sequence of a known protein tblastn Page 148

28 Search a DNA database (e.g. HTGS, dbEST, or genomic sequence from a specific organism) Start with the sequence of a known protein tblastn inspect Find matches… [1] to DNA encoding known proteins [2] to DNA encoding related (novel!) proteins [3] to false positives Page 148

29 from a specific organism) Start with the sequence of a known protein
Search a DNA database (e.g. HTGS, dbEST, or genomic sequence from a specific organism) Start with the sequence of a known protein tblastn inspect blastx or blastp nr Find matches… [1] to DNA encoding known proteins [2] to DNA encoding related (novel!) proteins [3] to false positives Search your DNA or protein against a protein database (nr) to confirm you have identified a novel gene Page 148

30 Page 148

31 Page 148

32 (Page 150)

33 this is a good candidate
for a novel gene/protein

34 A blastp nr search confirms that the Salmonella query is closely
related to other lipocalins (Page 150)

35 BLAST for gene discovery
You can use BLAST to find a “novel” gene Note to students taking this class for credit: You will need to do this for 40% of your grade. Ideally, try to find a new gene this week. You can discuss it anytime with me or Mayra, Hugh and Gek. You should have your novel protein by October 13 (for the first phylogeny lecture) so you can put your novel protein into a tree. I will provide sample projects from last year.


Download ppt "Advanced BLAST Searching"

Similar presentations


Ads by Google