Introduction to bioinformatics Lecture 8

Introduction to bioinformatics Lecture 8
Multiple sequence alignment (2)

Flavodoxin-cheY: Pre-processing (prepro1500)

Progressive multiple alignment general principles
1 Score 1-2 2 1 Score 1-3 3 4 Score 4-5 5 Scores Similarity matrix 5×5 Scores to distances Iteration possibilities Guide tree Multiple alignment

General progressive multiple alignment technique (follow generated tree)
1 3 1 3 2 5 1 3 2 5 root 1 3 2 5 4

Progressive multiple alignment
Problem: Accuracy is very important Errors are propagated through the progressive steps “Once a gap, always a gap” Feng & Doolittle, 1987

How to represent a block of sequences
Historically: consensus sequence - single sequence that best represents the amino acids observed at each alignment position Modern methods: Alignment profile – representation that retains the information about frequencies of amino acids observed at each alignment position

Multiple alignment profiles Gribskov et al. 1987
C D  W Y 0.3 0.1  Gap penalties 1.0 0.5 Position dependent gap penalties

Profile-sequence alignment
ACD……VWY

Sequence to profile alignment
V L 0.4 A 0.2 L 0.4 V Score of amino acid L in sequence that is aligned against this profile position: Score = 0.4 * s(L, A) * s(L, L) * s(L, V)

Profile-profile alignment
C D . Y profile ACD……VWY

Profile to profile alignment
V L G S 0.4 A 0.2 L 0.4 V 0.75 G 0.25 S Match score of these two alignment columns using the a.a frequencies at the corresponding profile positions: Score = 0.4*0.75*s(A,G) + 0.2*0.75*s(L,G) + 0.4*0.75*s(V,G) + + 0.4*0.25*s(A,S) + 0.2*0.25*s(L,S) + 0.4*0.25*s(V,S) s(x,y) is value in amino acid exchange matrix (e.g. PAM250, Blosum62) for amino acid pair (x,y)

Clustal, ClustalW, ClustalX
CLUSTAL W/X (Thompson et al., 1994) uses Neighbour Joining (NJ) algorithm (Saitou and Nei, 1984), widely used in phylogenetic analysis, to construct guide tree (see Lecture 4). Sequence blocks are represented by profiles, in which the individual sequences are additionally weighted according to the branch lengths in the NJ tree. Further carefully crafted heuristics include: (i) local gap penalties (ii) automatic selection of the amino acid substitution matrix, (iii) automatic gap penalty adjustment (iv) mechanism to delay alignment of sequences that appear to be distant at the time they are considered CLUSTAL (W/X) does not allow iteration

Strategies for multiple sequence alignment
Profile pre-processing Secondary structure-induced alignment Globalised local alignment Matrix extension Objective: try to avoid (early) errors

Profile pre-processing
1 Score 1-2 2 1 Score 1-3 3 4 5 Score 4-5 1 Key Sequence 2 1 3 Pre-alignment 4 5 Master-slave (N-to-1) alignment A C D . Y 1 Pre-profile Pi Px

Pre-profile generation
1 Score 1-2 2 1 Score 1-3 3 4 Score 4-5 5 Cut-off Pre-alignments Pre-profiles 1 1 A C D . Y 2 3 4 5 2 2 A C D . Y 1 3 4 5 5 A C D . Y 5 1 2 3 4

Profile pre-processing
1 Score 1-2 2 1 Score 1-3 3 4 5 Score 4-5 Pre-alignments Pre-profiles 1 1 A C D . Y 2 3 4 5 2 2 A C D . Y 1 3 4 5 5 A C D . Y 5 1 2 3 4

Pre-profile alignment
Pre-profiles 1 A C D . Y 2 A C D . Y Final alignment 3 A C D . Y 1 2 3 4 5 4 A C D . Y 5 A C D . Y

Pre-profile alignment
1 1 2 3 4 5 2 2 1 3 4 Final alignment 5 3 3 1 1 2 2 4 3 5 4 4 5 4 1 2 3 5 5 5 1 2 3 4

Pre-profile alignment Alignment consistency
Ala131 1 1 1 2 3 A131 L133 C126 4 5 2 2 1 2 3 4 5 3 3 1 2 4 5 4 4 1 2 5 3 5 5 5 1 2 3 4

PRALINE pre-profile generation
Idea: use the information from all query sequences to make a pre-profile for each query sequence that contains information from other sequences You can use all sequences in each pre-profile, or use only those sequences that will probably align ‘correctly’. Incorrectly aligned sequences in the pre-profiles will increase the noise level. Select using alignment score: only allow sequences in pre-profiles if their alignment with the score higher than a given threshold value. In PRALINE, this threshold is given as prepro=1500 (alignment score threshold value is 1500 – see next two slides)

Flavodoxin-cheY consistency scores (PRALINE prepro=0)
1fx TEYTAETIARQL VL999ST AQGRKVACF FLAV_DESVH TEYTAETIAREL VL999ST AQGRKVACF FLAV_DESDE YDAVL999SAW GRKVAAF FLAV_DESGI TEGVAEAIAKTL DVVL999ST FLAV_DESSA STW 4fxn FLAV_MEGEL 2fcr TEVADFIGK DLLF FLAV_ANASP LFYGTQTGKTESVAEIIR FLAV_ECOLI GSDTGNTENIAKMIQ FLAV_AZOVI IGLFFGSNTGKTRKVAKSIK FLAV_ENTAG FLAV_CLOAB ILYSSKTGKTERVAK 3chy Avrg Consist Conservation 1fx G FLAV_DESVH G FLAV_DESDE A FLAV_DESGI FLAV_DESSA 4fxn FLAV_MEGEL 2fcr FLAV_ANASP FLAV_ECOLI FLAV_AZOVI FLAV_ENTAG FLAV_CLOAB 3chy Avrg Consist Conservation * Iteration 0 SP= AvSP= SId= AvSId= 0.297 Consistency values are scored from 0 to 10; the value 10 is represented by the corresponding amino acid (red)

Flavodoxin-cheY consistency scores
(PRALINE prepro=1500) 1fx IVYGSTTGNTEYTAETIARQL DLVLLGCSTW AQGRKVACF FLAV_DESVH IVYGSTTGNTEYTAETIAREL DLVLLGCSTW AQGRKVACF FLAV_DESSA IVYGSTTGNTET YDIVLFGCSTW SL98ADLKGKKVSVF FLAV_DESGI IVYGSTTGNTEGVA DVVLLGCSTW KKVGVF FLAV_DESDE IVFGSSTGNTE YDAVLFGCSAW GRKVAAF 4fxn IVYWSGTGNTE NI DILILGCSA ISGKKVALF FLAV_MEGEL IVYWSGTGNTEAMA DVILLGCPAMGSE GKKVGLF 2fcr IFFSTSTGNTTEVA YDLLFLGAPT DKLPEVDMKDLPVAIF FLAV_ANASP LFYGTQTGKTESVAEII YQYLIIGCPTW W GKLVAYF FLAV_AZOVI LFFGSNTGKTRKVAKSIK YQFLILGTPTLGEG KTVALF FLAV_ENTAG IGIFFGSDTGQTRKVAKLIHQKL DVRRATR88888SYPVLLLGTPT WQEF8-8NTLSEADLTGKTVALF FLAV_ECOLI IFFGSDTGNTENIAKMI YDILLLGIPT KLVALF FLAV_CLOAB ILYSSKTGKTERVAKLIE LQESEGIIFGTPTY SWE GKLGAAF 3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFNNVEEAEDGVDALNKLQ-AGGYGFVI---SDWNMPNM DGLEL--LKTIRADGAMSALPVLM Avrg Consist Conservation fx G FLAV_DESVH G FLAV_DESSA G FLAV_DESGI G GATLV FLAV_DESDE AS fxn GS FLAV_MEGEL G MD--AWKQRTEDTGATVI fcr GLGDA5-8Y5DNFC FLAV_ANASP GTGDQ5-GY EEKISQRGG FLAV_AZOVI GLGDQ FLAV_ENTAG GLGDQL-NYSKNFVSA-MR--ILYDLVIARGACVVG8888EGYKFSFSAA6664NEFVGLPLDQEN88888EERIDSWLE FLAV_ECOLI GC FLAV_CLOAB STANS EDENARIFGERIANKVKQI chy VTAEA---KKENIIAA AQAGAS GYVVK-----PFTAATLEEKLNKIFEKLGM Avrg Consist Conservation * Iteration 0 SP= AvSP= SId= AvSId= 0.308 Consistency values are scored from 0 to 10; the value 10 is represented by the corresponding amino acid (red)

Iteration Alignment iteration: do an alignment learn from it
do it better next time Bootstrapping

Consistency iteration
Pre-profiles Multiple alignment positional consistency scores The consistency weights in the multiple alignment for each sequence are copied into a vector for each sequence (red-black vectors above each pre-profile) and used as weights in the DP runs for aligning sequences and sequence blocks to make a new (and hopefully better) multiple sequence alignment.

Pre-profile update iteration
Pre-profiles Multiple alignment The sequences as aligned in the multiple alignment are copied into the pre-profiles for each sequence. This changes the matching in the master-slave alignment (pre-alignment) and leads to different pre-profiles for the next iteration, which in turn will lead to a different (and hopefully better) MSA.

Iteration: three different scenarios
Convergence Limit cycle Divergence A computer program should check whether iteration reaches Convergence or Limit cycle states. To deal with Divergence, often a maximum number of iterations is specified to limit computation times.

Introduction to bioinformatics Lecture 8

Similar presentations

Presentation on theme: "Introduction to bioinformatics Lecture 8"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to bioinformatics Lecture 8

Similar presentations

Presentation on theme: "Introduction to bioinformatics Lecture 8"— Presentation transcript:

Similar presentations

About project

Feedback