Presentation is loading. Please wait.

Presentation is loading. Please wait.

LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg http://xin.cz3.nus.edu.sg.

Similar presentations


Presentation on theme: "LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg http://xin.cz3.nus.edu.sg."— Presentation transcript:

1 LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel: Room 07-24, level 7, SOC1, National University of Singapore

2 Sequence Analysis Methods

3 Gene and Protein Sequence Alignment as a Mathematical Problem:
Example: Sequence a:  ATTCTTGC Sequence b: ATCCTATTCTAGC            Best Alignment:             ATTCTTGC                                  ATCCTATTCTAGC                                           /|\                   gap        Bad Alignment: AT     TCTT       GC                                  ATCCTATTCTAGC                                                                /|\             /|\                                           gap          gap What is a good alignment? 

4 How to rate an alignment?
Match: +8 (w(x, y) = 8, if x = y) Mismatch: -5 (w(x, y) = -5, if x ≠ y) Each gap symbol: -3 (w(-,x)=w(x,-)=-3) a1 a2 a3 - - x - - b1 b2 b3 - - y - -

5 Pairwise Alignment An alignment of a and b: Sequence a: CTTAACT
Sequence b: CGGATCAT An alignment of a and b: C---TTAACT CGGATCA--T Insertion gap Match Mismatch Deletion gap

6 Alignment Graph C---TTAACT CGGATCA--T Sequence a: CTTAACT
Sequence b: CGGATCAT Insertion gap C G G A T C A T C T T A A C T C---TTAACT CGGATCA--T Deletion gap

7 Graphic representation of an alignment
Sequence a: CTTAACT Sequence b: CGGATCAT C C C---TTAACT CGGATCA--T

8 Graphic representation of an alignment
Sequence a: CTTAACT Sequence b: CGGATCAT C G G A C C---TTAACT CGGATCA--T

9 Graphic representation of an alignment
Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C T C---TTAACT CGGATCA--T

10 Graphic representation of an alignment
Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A C T T A A C C---TTAACT CGGATCA--T

11 Graphic representation of an alignment
Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T C T T A A C T C---TTAACT CGGATCA--T

12 Pathway of an alignment
Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T C T T A A C T C---TTAACT CGGATCA--T

13 Graphic representation of an alignment
Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T C T T A A C T CTTAACT- CGGATCAT

14 Pathway of an alignment
Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T C T T A A C T CTTAACT- CGGATCAT

15 Use of graph to generate alignments
Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T C T T A A C T - CTTAACT CGGATCAT

16 Use of graph to generate alignments
Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T C T T A A C T - C - - TTAACT CGGATC - AT -

17 Use of graph to generate alignments
Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T C T T A A C T CTTAACT - - CGGATCAT

18 Which pathway is better?
Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T C T T A A C T Multiple pathways Each with a unique scoring function

19 Alignment Score 8 C---TTAACT CGGATCA--T Sequence a: CTTAACT
Sequence b: CGGATCAT C G G A T C A T 8 C T T A A C T C---TTAACT CGGATCA--T

20 Alignment Score C---TTAACT CGGATCA--T Sequence a: CTTAACT
Sequence b: CGGATCAT C G G A T C A T 8 8-3 =5 C T T A A C T C---TTAACT CGGATCA--T

21 Alignment Score C---TTAACT CGGATCA--T Sequence a: CTTAACT
Sequence b: CGGATCAT C G G A T C A T 8 8-3 =5 5-3 =2 2-3 =-1 C T T A A C T C---TTAACT CGGATCA--T

22 Alignment Score C---TTAACT CGGATCA--T Sequence a: CTTAACT
Sequence b: CGGATCAT C G G A T C A T 8 5 2 -1 -1+8 =7 7-3 =4 4+8 =12 12-3 =9 9-3 =6 C T T A A C T C---TTAACT CGGATCA--T Alignment score 6+8=14

23 An optimal alignment -- the alignment of maximum score
Let A=a1a2…am and B=b1b2…bn . Si,j: the score of an optimal alignment between a1a2…ai and b1b2…bj With proper initializations, Si,j can be computed as follows.

24 Computing Si,j j w(ai,bj) w(ai,-) i w(-,bj) Sm,n

25 Initializations C G G A T C A T C T T A A C T Gap symbol: -3 -3 -6 -9
S0,1=-3, S0,2=-6, S0,3=-9, S0,4=-12, S0,5=-15, S0,6=-18, S0,7=-21, S0,8=-24 S1,0=-3, S2,0=-6, S3,0=-9, S4,0=-12, S5,0=-15, S6,0=-18, S7,0=-21 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 C T T A A C T

26 S1,1 = ? C G G A T C A T ? C T T A A C T Match: 8 Mismatch: -5
Gap symbol: -3 Option 1: S1,1 = S0,0 +w(a1, b1) = 0 +8 = 8 Option 2: S1,1=S0,1 + w(a1, -) = = -6 Option 3: S1,1=S1,0 + w( - , b1) = -3-3 = -6 Optimal: S1,1 = 8 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 ? C T T A A C T

27 S1,2 = ? C G G A T C A T C T T A A C T Match: 8 Mismatch: -5
Gap symbol: -3 Option 1: S1,2 = S0,1 +w(a1, b2) = = -8 Option 2: S1,2=S0,2 + w(a1, -) = = -9 Option 3: S1,2=S1,1 + w( - , b2) = 8-3 = 5 Optimal: S1,2 =5 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 8 ? C T T A A C T

28 S2,1 = ? C G G A T C A T C T T A A C T Match: 8 Mismatch: -5
Gap symbol: -3 S2,1 = ? Option 1: S2,1= S1,0 +w(a2, b1) = = -8 Option 2: S2,1=S1,1 + w(a2, -) = = 5 Option 3: S2,1=S2,0 + w( - , b1) = -6-3 = -9 Optimal: S2,1 =5 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 8 5 ? C T T A A C T

29 S2,2 = ? C G G A T C A T C T T A A C T Match: 8 Mismatch: -5
Gap symbol: -3 Option 1: S2,2= S1,1 +w(a2, b2) = 8 -5 = 3 Option 2: S2,2=S1,2 + w(a2, -) = = 2 Option 3: S2,2=S2,1 + w( - , b2) = 5-3 = 2 Optimal: S2,2 =3 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 8 5 ? C T T A A C T

30 S3,5 = ? C G G A T C A T C T T A A C T -3 -6 -9 -12 -15 -18 -21 -24 8
-3 -6 -9 -12 -15 -18 -21 -24 8 5 2 -1 -4 -7 -10 -13 3 7 4 1 -2 -5 ? C T T A A C T

31 S3,5 = ? C G G A T C A T C T T A A C T -3 -6 -9 -12 -15 -18 -21 -24 8
-3 -6 -9 -12 -15 -18 -21 -24 8 5 2 -1 -4 -7 -10 -13 3 7 4 1 -2 -5 9 6 -8 -11 -14 14 C T T A A C T optimal score

32 C T T A A C – T C G G A T C A T 8 – 5 –5 +8 -5 +8 -3 +8 = 14
8 – 5 – = 14 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 8 5 2 -1 -4 -7 -10 -13 3 7 4 1 -2 -5 9 6 -8 -11 -14 14 C T T A A C T

33 Local vs. Global Sequence Alignment:
Example: DNA sequence a:  ATTCTTGC DNA sequence b: ATCCTATTCTAGC            Local Alignment:             ATTCTTGC Gaps ignored in local alignments                                  ATCCTATTCTAGC                                          /|\                   gap        Global Alignment: AT     TCTT       GC                                  ATCCTATTCTAGC                                                              /|\             /|\                                      gap          gap Gaps counted in global alignments 

34 Global Alignment vs. Local Alignment
All sections are counted Only local sections (normally separated by gaps) are counted

35 An optimal local alignment
Si,j: the score of an optimal local alignment ending at ai and bj With proper initializations, Si,j can be computed as follows.

36 Initializations C G G A T C A T C T T A A C T Match: 8 Mismatch: -5
Gap symbol: -3 C G G A T C A T C T T A A C T

37 S1,1 = ? C G G A T C A T ? C T T A A C T Match: 8 Mismatch: -5
Gap symbol: -3 Option 1: S1,1 = S0,0 +w(a1, b1) = 0 +8 = 8 Option 2: S1,1=S0,1 + w(a1, -) = = -3 Option 3: S1,1=S1,0 + w( - , b1) = 0-3 = -3 Option 4: S1,1=0 Optimal: S1,1 = 8 C G G A T C A T ? C T T A A C T

38 local alignment C G G A T C A T 8 5 2 3 13 11 ? C T T A A C T Match: 8
Mismatch: -5 Gap symbol: -3 C G G A T C A T 8 5 2 3 13 11 ? C T T A A C T

39 local alignment A – C - T A T C A T 8-3+8-3+8 = 18 C G G A T C A T 8 5
8 5 2 3 13 11 10 7 18 C T T A A C T The best score

40 BLAST Basic Local Alignment Search Tool
Procedure: Divide all sequences into overlapping constituent words (size k) Build the hash table for Sequence a. Scan Sequence b for hits. Extend hits.

41 BLAST Basic Local Alignment Search Tool
Step 1: Hash table for sequence A

42 Amino acid similarity matrix PAM 120
Instead of using the simple values +8 and -5 for matches and mismatches, this statistically derived score matrix is used to rank the level of similarity between two amino acids

43 Amino acid similarity matrix PAM 250
This is a more popularly used score matrix for ranking the level of similarity of two amino acids. It is derived by consideration of more diverse sets of data and more number of statistical steps.

44 Amino acid similarity matrix Blosum 45
The Blosum matrices were calculated using data from the BLOCKS database which contains alignments of more distantly-related proteins. In principle, Blosum matrices should be more realistic for comparing distantly-related proteins, but may introduce error for conventional proteins. .

45 BLAST Basic Local Alignment Search Tool

46 BLAST Basic Local Alignment Search Tool
Step 2: Use all of the 2-letter words in query sequence to scan against database sequence and mark those with score > 8 Note: Marked points can be on the diagonal and off-diagonal LN:LN=9 NF:NY=8 GW:PW=10

47 BLAST Step2: Scan sequence b for hits.

48 BLAST Step2: Scan sequence b for hits. Step 3: Extend hits.
BLAST 2.0 saves the time spent in extension, and considers gapped alignments. hit Terminate if the score of the extension fades away.

49 Multiple sequence alignment (MSA)
The multiple sequence alignment problem is to simultaneously align more than two sequences. Seq1: GCTC Seq2: AC Seq3: GATC GC-TC A---C G-ATC

50 Multiple sequence alignment MSA

51 How to score an MSA? Sum-of-Pairs (SP-score) Score + Score Score = +
GC-TC A---C Score + GC-TC A---C G-ATC GC-TC G-ATC Score Score = + A---C G-ATC Score

52 How to score an MSA? Sum-of-Pairs (SP-score) Score + Score Score = +
= 5 + = = 5 = 28 SP-score=5+18+5=28 GC-TC A---C Score + GC-TC A---C G-ATC GC-TC G-ATC Score Score = + A---C G-ATC Score

53 Position Specific Iterated BLAST
PSI-BLAST is a rather permissive alignment tool and it can find more distantly related sequences than FASTA or BLAST Especially, in many cases, it is much more sensitive to weak but biologically relevant sequence similarities.

54 Position Specific Iterated BLAST
PSI-BLAST is used for: Distant homology detection Fold assignment: profile-profile comparison Domain identification Evolutionary Analysis (e.g. tree building) Sequence Annotation / function assignment Profile export to other programs Sequence clustering Structural genomics target selection

55 Position Specific Iterated BLAST
Collect all database sequence segments that have been aligned with query sequence with E-value below set threshold (default 0.001, but all sequences with E<10 are displayed for manual inclusion) Construct position specific scoring matrix for collected sequences. Rough idea: Align all sequences to the query sequence as the template. Assign weights to the sequences Construct position specific scoring matrix Iterate

56 How PLS-BLAST works? using profile Take a sequence
. Y using profile Take a sequence MGLLTREIF--ILQQ Search for similar sequences in a full sequence database MGLLTREIF--ILQQ FGLLRT-I-T-YMTN -RLTRD-I---LGLY FGLLRT-I---FMTS New sequences in the multiple alignment FGLGRT-I-T-YMTN -GLVRT-I---LGLE FGLLRT-I---YMTQ Sequences are multiply aligned A C . Y Construct a new profile A C . Y After several iterations of this procedure we have: Sequence information, including links to annotation Several sets of multiple alignments. Profiles, derived by us or by PSI-BLAST Threshold information (alignment statistics) Construct a profile, and represent conservation in each position numerically Profile holds more information than a single sequence: use the profile to retrieve additional sequences

57 Consensus sequence A sequence where each position is defined by majority vote based on multiple sequence alignment. Use consensus sequence for data base search. PEAINYGRFTPFS I KSDVW

58 Flow chart of PSI-BLAST
MGLLTREIF--ILQQ FGLGRT-I-T-YMTN -GLVRT-I---LGLE FGLLRT-I---YMTQ Take a sequence Search for similar sequences in a full sequence database A C . Y Construct a profile, and represent conservation in each position numerically Profile holds more information than a single sequence: use the profile to retrieve additional sequences Sequences are multiply aligned Construct a new profile A C . Y Using profile to search for similar sequences in a full sequence database A Y FGLLRT-I-T-YMTN -RLTRD-I---LGLY FGLLRT-I---FMTS New sequences in the multiple alignments New iteration Next New iteration……

59 PSI-BLAST NCBI PSI-BLAST tutorial :

60 PSI-BLAST NCBI PSI-BLAST tutorial :

61 PSI-BLAST NCBI PSI-BLAST tutorial :

62 PSI-BLAST NCBI PSI-BLAST tutorial :

63 PSI-BLAST NCBI PSI-BLAST tutorial :

64 PSI-BLAST NCBI PSI-BLAST tutorial :

65 PSI-BLAST NCBI PSI-BLAST tutorial :

66 PSI-BLAST NCBI PSI-BLAST tutorial :

67 PSI-BLAST NCBI PSI-BLAST tutorial :

68 PSI-BLAST NCBI PSI-BLAST tutorial :

69 PSI-BLAST NCBI PSI-BLAST tutorial :

70 PSI-BLAST NCBI PSI-BLAST tutorial :

71 PSI-BLAST NCBI PSI-BLAST tutorial :

72 PSI-BLAST NCBI PSI-BLAST tutorial :

73 PSI-BLAST NCBI PSI-BLAST tutorial :

74 PSI-BLAST NCBI PSI-BLAST tutorial :

75 PSI-BLAST NCBI PSI-BLAST tutorial :

76 PSI-BLAST NCBI PSI-BLAST tutorial :

77 PSI-BLAST NCBI PSI-BLAST tutorial :

78 PSI-BLAST NCBI PSI-BLAST tutorial :

79 PSI-BLAST NCBI PSI-BLAST tutorial :

80 PSI-BLAST NCBI PSI-BLAST tutorial :

81 PSI-BLAST NCBI PSI-BLAST tutorial :

82 PSI-BLAST NCBI PSI-BLAST tutorial :

83 PSI-BLAST NCBI PSI-BLAST tutorial :

84 Summary of Today’s lecture
Sequence alignment methods revisited: Pair-wise alignment Multiple sequence alignment BLAST PSI-BLAST Use of PSI-BLAST to probe protein function


Download ppt "LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg http://xin.cz3.nus.edu.sg."

Similar presentations


Ads by Google