Protein Sequence Alignment Multiple Sequence Alignment

Protein Sequence Alignment Multiple Sequence Alignment
Part 3 Protein Sequence Alignment Multiple Sequence Alignment

Table 3.1. Web sites for alignment of sequence pairs Name of site Bayes block alignera Zhu et al. (1998) Likelihood-weighted sequence alignmentb see Web site PipMaker (percent identity plot), a graphical tool for assessing long alignments Schwartz et al. (2000) BCM Search Launcherc SIM—Local similarity program for finding alternative alignments Huang et al. (1990); Huang and Miller (1991); Pearson and Miller (1992) Global alignment programs (GAP, NAP) Huang (1994) FASTA program suited Pearson and Miller (1992); Pearson (1996) Pairwise BLASTe Altschul et al. (1990) AceViewf shows alignment of mRNAs and ESTs to the genome sequence BLATf Fast alignment for finding genes in genome Kent (2002) GeneSeqerf predicts genes and aligns mRNA and genome sequences Usuka et al. (2000) SIM4f Floria et al. (1998)

Protein Sequence Alignment

Protein Pairwise Sequence Alignment
The alignment tools are similar to the DNA alignment tools BLASTP, FASTA Main difference: instead of scoring match (+2) and mismatch (-1) we have similarity scores: Score s(i,j) > 0 if amino acids i and j have similar properties Score s(i,j) is  0 otherwise How should we score s(i,j)?

The 20 Amino Acids

Chemical Similarities Between Amino Acids
Acids & Amides DENQ (Asp, Glu, Asn, Gln) Basic HKR (His, Lys, Arg) Aromatic FYW (Phe, Tyr, Trp) Hydrophilic ACGPST (Ala, Cys, Gly, Pro, Ser, Thr) Hydrophobic ILMV (Ile, Leu, Met, Val)

Amino Acid Substitutions Matrices
For aligning amino acids, we need a scoring matrix of 20 rows  20 columns Matrices represent biological processes Mutation causes changes in sequence Evolution tends to conserve protein function Similar function requires similar amino acids Could base matrix on amino acid properties In practice: based on empirical data

identity similarity

Given an alignment of closely related sequences
we can score the relation between amino acids based on how frequently they substitute each other AGHKKKR D SFHRRRAGC D E - S In this column E & D are found 8/10

Amino Acid Matrices Symmetric matrix of 20x20 entries: entry (i,j)=entry(j,i) Entry (i,i) is greater than any entry (i,j), ji. Entry (i,j): the score of aligning amino acid i against amino acid j.

PAM - Point Accepted Mutations
Developed by Margaret Dayhoff, 1978. Analyzed very similar protein sequences Proteins are evolutionary close. Alignment is easy. Point mutations - mainly substitutions Accepted mutations - by natural selection. Used global alignment. Counted the number of substitutions (i,j) per amino acid pair: Many i<->j substitutions => high score s(i,j) Found that common substitutions occurred involving chemically similar amino acids.

PAM 250 Similar amino acids are close to each other.
Regions define conserved substitutions.

Selecting a PAM Matrix Low PAM numbers: short sequences, strong local similarities. High PAM numbers: long sequences, weak similarities. PAM120 recommended for general use (40% identity) PAM60 for close relations (60% identity) PAM250 for distant relations (20% identity) If uncertain, try several different matrices PAM40, PAM120, PAM250 recommended

BLOSUM Blocks Substitution Matrix
Steven and Jorga G. Henikoff (1992) Based on BLOCKS database ( Families of proteins with identical function Highly conserved protein domains Ungapped local alignment to identify motifs Each motif is a block of local alignment Counts amino acids observed in same column Symmetrical model of substitution AABCDA… BBCDA DABCDA. A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA… BBCCC

BLOSUM Matrices Different BLOSUMn matrices are calculated independently from BLOCKS BLOSUMn is based on sequences that are at most n percent identical.

Selecting a BLOSUM Matrix
For BLOSUMn, higher n suitable for sequences which are more similar BLOSUM62 recommended for general use BLOSUM80 for close relations BLOSUM45 for distant relations

Multiple Sequence Alignment

Multiple Alignment Like pairwise alignment
n input sequences instead of 2 Add indels to make same length Local and global alignments Score columns in alignment independently Seek an alignment to maximize score

Alignment Example GTCGTAGTCGGCTCGAC GTCTAGCGAGCGTGAT GCGAAGAGGCGAGC GCCGTCGCGTCGTAAC 1*1 2*0.75 11*0.5 Score=8 GTCGTAGTCG-GC-TCGAC GTC-TAG-CGAGCGT-GAT GC-GAAG-AG-GCG-AG-C GCCGTCG-CG-TCGTA-AC 4*1 11*0.75 2*0.5 Score=13.25 Score : 4/4 =1 , 3/4 =0.75 , 2/4=0.5 , 1/4= 0

Dynamic Programming Pairwise A–B alignment table
Cell (i,j) = score of best alignment between first i elements of A and first j elements of B Complexity: length of A  length of B 3-way A–B–C alignment table Cell (i,j,k) = score of best alignment between first i elements of A, first j of B, first k of C Complexity: length A  length B  length C

MSA Complexity n-way S1–S2–…–Sn-1–Sn alignment table
Cell (x1,…,xn) = best alignment score between first x1 elements of S1, …, xn elements of Sn Complexity: length S1  …  length Sn Example: protein family alignment 100 proteins, 1000 amino acids each Complexity: table cells Calculation time: beyond the big bang!

Feasible Approach Based on pairwise alignment scores
Build n by n table of pairwise scores Align similar sequences first After alignment, consider as single sequence Continue aligning with further sequences

Sum of pairwise alignment scores
For n sequences, there are n(n-1)/2 pairs GTCGTAGTCG-GC-TCGAC GTC-TAG-CGAGCGT-GAT GC-GAAG-AG-GCG-AG-C GCCGTCG-CG-TCGTA-AC

1 GTCGTAGTCG-GC-TCGAC 2 GTC-TAG-CGAGCGT-GAT 3 GC-GAAGAGGCG-AGC 4 GCCGTCGCGTCGTAAC

ClustalW Algorithm Progressive Sequences Alignment (Higgins and Sharp 1988) Compute pairwise alignment for all the pairs of sequences. Use the alignment scores to build a phylogenetic tree such that similar sequences are neighbors in the tree distant sequences are distant from each other in the tree. The sequences are progressively aligned according to the branching order in the guide tree.

Progressive Sequence Alignment
(Protein sequences example) N Y L S N K Y L S N F S N F L S N K/- Y L S N F L/- S N K/- Y/F L/- S

Treating Gaps in ClustalW
Penalty for opening gaps and additional penalty for extending the gap Gaps found in initial alignment remain fixed New gaps are introduced as more sequences are added (decreased penalty if gap exists) Decreased within stretches of hydrophilic residues

MSA Approaches Progressive approach CLUSTALW (CLUSTALX) PILEUP
T-COFFEE Iterative approach: Repeatedly realign subsets of sequences. MultAlin, DiAlign. Statistical Methods: Hidden Markov Models SAM2K Genetic algorithm SAGA

Protein Sequence Alignment Multiple Sequence Alignment

Similar presentations

Presentation on theme: "Protein Sequence Alignment Multiple Sequence Alignment"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Protein Sequence Alignment Multiple Sequence Alignment

Similar presentations

Presentation on theme: "Protein Sequence Alignment Multiple Sequence Alignment"— Presentation transcript:

Similar presentations

About project

Feedback