Presentation is loading. Please wait.

Presentation is loading. Please wait.

Identifying templates for protein modeling:

Similar presentations


Presentation on theme: "Identifying templates for protein modeling:"— Presentation transcript:

1 Identifying templates for protein modeling:
Lecture 2 Identifying templates for protein modeling: Sequence alignment with BLAST and PSI-BLAST

2 Sources and additional information
Images and other material in this presentation are taken from Bioinformatics and Functional Genomics third edition by Jonathan Pevsner, 2015 John Wiley & Sons, Inc. ( The lecture follows closely the contents of chapter 4 of Pevsner book, which contains an in-depth discussion of the issues covered during the lecture. For additional material, please go to the book website:

3 Sequence alignment with BLAST
Webserver interface Aspects of the Algorithm Alignment strategies Detection of distant evolutionary relationships: PSI-BLAST

4 BLAST BLAST (Basic Local Alignment Search Tool)
Scan large databases of sequences

5 Typical use identifying orthologs and paralogs
discovering variants proteins Exploring structure-function relations

6 BLAST requires four choices
Choose the query sequence Select the BLAST program Choose a database Select optional parameters

7 Web interface

8 How to get FASTA format for the query sequence

9 Five distinct BLAST programs
blastn (nucleotide BLAST) blastp (protein BLAST) blastx (translated BLAST) tblastn (translated BLAST) tblastx (translated BLAST)

10 Some optional search parameters
organism algorithm

11 Why low complexity filter?
(a) Query: human insulin NP_000198 Program: blastp Database: C. elegans RefSeq Default settings: Unfiltered (“composition-based statistics”)

12 Why low complexity filter?
(d) Query: human insulin NP_000198 Program: blastp Database: C. elegans RefSeq Option: Filter low complexity regions Different bit score !

13 BLAST search output

14 BLAST search output

15 BLAST search output

16 Sequence alignment with BLAST
Webserver interface Aspects of the Algorithm Alignment strategies Detection of distant evolutionary relationships: PSI-BLAST

17 BLAST: what kind of alignment?
Global alignment (Needleman & Wunsch 1970): Uses dynamic programming Gaps are inserted so that the total lengths of both sequences are aligned (“global”).

18 BLAST: what kind of alignment?
Local alignment (Smith & Waterman, 1980): Just a portion of either sequence is aligned Useful to find matching domains in two sequences. BLAST finds a local alignment through a heuristic approach

19 How the BLAST works: three phases
Phase 1: compile a list of word pairs (w=3) above threshold T Example: for a human RBP query …FSGTWYA… (query word is in yellow) A list of words (w=3) is: FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS

20 Phase 1: compile a list of words (w=3) and score them according to BLOSUM matrices
GTW 6,5,11 22 neighborhood GSW 6,1,11 18 word hits ATW 0,5,11 16 > threshold NTW 0,5,11 16 GTY 6,5,2 13 GNW 10 neighborhood GAW 9 word hits < below threshold (T=11)

21 BLAST second phase Phase 2:
Scan the database to find matches for the compiled list.

22 BLAST thrid phase extend extend Hit! KENFDKARFSGTWYAMAKKDPEG 50 query
Phase 3: extend the hit in either direction (with Smith Waterman and scoring matrix). Stop when the score drops below some cutoff. KENFDKARFSGTWYAMAKKDPEG 50 query MKGLDIQKVAGTWYSLAMAASD. 44 hit extend extend Hit!

23 How to interpret a BLAST search: expect value
It is important to assess the statistical significance of search results. For local alignments (including BLAST search results), the statistics are well understood. The scores follow an extreme value distribution (EVD) rather than a normal distribution.

24 E = Kmn e-lS E-value from extreme value distribution (number of high- scoring segment pairs expected to occur with a score of at least S) S = the score m, n = the length of two sequences l, K = Karlin Altschul statistics (empirical)

25 How to interpret BLAST: E values and p values
Very small E values are very similar to p values. E values of about 1 to 10 are far easier to interpret than corresponding p values. E p (about 0.1) (about 0.05) (about 0.001)

26 Sequence alignment with BLAST
Webserver interface Aspects of the Algorithm Alignment strategies Detection of distant evolutionary relationships: PSI-BLAST

27 A real match might have E value > 1
Where do we stop? running BLAST with a putative hit as a query might help to establish a threshold

28 Sometimes a similar E value occurs for a
short exact match and long less exact match short, nearly exact long, only 31% identity, similar E value

29 Sequence alignment with BLAST
Webserver interface Aspects of the Algorithm Alignment strategies Detection of distant evolutionary relationships: PSI-BLAST

30 PSI-BLAST is performed in five steps
[1] Scan the protein database with a query [2] PSI-BLAST uses the hits to generate a multiple sequence alignment. The latter is used to initialize a position-specific scoring matrix (PSSM)

31 Inspect the blastp output to identify empirical “rules” regarding amino acids tolerated at each position R,I,K C D,E,T K,R,T N,L,Y,G

32 A R N D C Q E G H I L K M F P S T W Y V
... 37 S 38 G 39 T 40 W 41 Y 42 A 20 amino acids all the amino acids from position 1 to the end of your PSI-BLAST query protein

33 A R N D C Q E G H I L K M F P S T W Y V
... 37 S 38 G 39 T 40 W 41 Y 42 A

34 A R N D C Q E G H I L K M F P S T W Y V
... 37 S 38 G 39 T 40 W 41 Y 42 A note that a given amino acid (such as alanine) in your query protein can receive different scores for matching alanine—depending on the position in the protein

35 A R N D C Q E G H I L K M F P S T W Y V
... 37 S 38 G 39 T 40 W 41 Y 42 A note that a given amino acid (such as tryptophan) in your query protein can receive different scores for matching tryptophan—depending on the position in the protein

36 PSI-BLAST is performed in five steps
[1] Scan the protein database with a query [2] PSI-BLAST uses the hits to generate a multiple sequence alignment. The latter is used to initialize a position-specific scoring matrix (PSSM) [3] The PSSM is used to score the alignments of the query with the database [4] Statistical significance (E values) are re-estimated on the basis of the new raw scores (from the PSSM)

37 Note the new entries: some hits bacame statistically significant with the PSSM

38 PSI-BLAST is performed in five steps
[1] Scan the protein database with a query [2] PSI-BLAST uses the hits to generate a multiple sequence alignment. The latter is used to initialize a position-specific scoring matrix (PSSM) [3] The PSSM is used to score the alignments of the query with the database [4] Statistical significance (E values) are re-estimated on the basis of the new raw scores (from the PSSM) [5] Iterate through [3] and [4 until convergence (only in principle, in practice two or three times)

39 “Rate of Convergence” of PSI-BLAST searches
# hits Iteration # hits > threshold


Download ppt "Identifying templates for protein modeling:"

Similar presentations


Ads by Google