Biology 224 Instructor: Tom Peavy October 18 & 20, 2010 <Images adapted from Bioinformatics and Functional Genomics by Jonathan Pevsner> Multiple Sequence.

Biology 224 Instructor: Tom Peavy October 18 & 20, 2010 <Images adapted from Bioinformatics and Functional Genomics by Jonathan Pevsner> Multiple Sequence Alignment

Multiple sequence alignment: definition a collection of three or more protein (or nucleic acid) sequences that are partially or completely aligned Homologous residues are aligned in columns across the length of the sequences residues are homologous in an evolutionary sense residues are homologous in a structural sense

Multiple sequence alignment: properties not necessarily one “correct” alignment of a protein family protein sequences evolve......the corresponding three-dimensional structures of proteins also evolve may be impossible to identify amino acid residues that align properly (structurally) throughout a multiple sequence alignment for two proteins sharing 30% amino acid identity, about 50% of the individual amino acids are superposable in the two structures

Multiple sequence alignment: features some aligned residues, such as cysteines that form disulfide bridges, may be highly conserved there may be conserved motifs such as a transmembrane domain there may be conserved secondary structure features there may be regions with consistent patterns of insertions or deletions (indels)

Multiple sequence alignment: methods There are two main ways to make a multiple sequence alignment: (1)Progressive alignment (Feng & Doolittle). (e.g. ClustalW) (2) Iterative approaches.

Use Clustal W to do a progressive MSA http://www2.ebi. ac.uk/clustalw/

Feng-Doolittle MSA occurs in 3 stages [1] Do a set of global pairwise alignments (Needleman and Wunsch) [2] Create a guide tree [3] Progressively align the sequences

Progressive MSA stage 1 of 3: generate global pairwise alignments Start of Pairwise alignments Aligning... Sequences (1:2) Aligned. Score: 84 Sequences (1:3) Aligned. Score: 84 Sequences (1:4) Aligned. Score: 91 Sequences (1:5) Aligned. Score: 92 Sequences (2:3) Aligned. Score: 99 Sequences (2:4) Aligned. Score: 86 Sequences (2:5) Aligned. Score: 85 Sequences (3:4) Aligned. Score: 85 Sequences (3:5) Aligned. Score: 84 Sequences (4:5) Aligned. Score: 96 five closely related lipocalins best score

Number of pairwise alignments needed For N sequences, (N-1)(N)/2 For 5 sequences, (4)(5)/2 = 10

Feng-Doolittle stage 2: guide tree Convert similarity scores to distance scores A tree shows the distance between objects Distance methods used (i.e. Neighbor joining) ClustalW provides a syntax to describe the tree A guide tree is not a phylogenetic tree

Progressive MSA stage 2 of 3: generate guide tree five closely related lipocalins 3 (rat RBP) 2 (murine RBP) 4 (porcine RBP) 5 (bovine RBP) 1 (human RBP) ((Human RBP:0.04284,(Mouse RBP:0.00075, Rat RBP:0.00423) :0.10542) :0.01900, Pig RBP:0.01924, Bovine RBP:0.01902);

Feng-Doolittle stage 3: progressive alignment Make a MSA based on the order in the guide tree Start with the two most closely related sequences Then add the next closest sequence Continue until all sequences are added to the MSA Rule: “once a gap, always a gap”

Clustal W alignment of 5 closely related lipocalins CLUSTAL W (1.82) multiple sequence alignment gi|89271|pir||A39486 MEWVWALVLLAALGSAQAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 50 gi|132403|sp|P18902|RETB_BOVIN ------------------ERDCRVSSFRVKENFDKARFAGTWYAMAKKDP 32 gi|5803139|ref|NP_006735.1| MKWVWALLLLAAW--AAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 48 gi|6174963|sp|Q00724|RETB_MOUS MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50 gi|132407|sp|P04916|RETB_RAT MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50 ********************:* ***:***** gi|89271|pir||A39486 EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 100 gi|132403|sp|P18902|RETB_BOVIN EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 82 gi|5803139|ref|NP_006735.1| EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED 98 gi|6174963|sp|Q00724|RETB_MOUS EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100 gi|132407|sp|P04916|RETB_RAT EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100 *********:*******.*:************.**:************** gi|89271|pir||A39486 PAKFKMKYWGVASFLQKGNDDHWIIDTDYDTYAAQYSCRLQNLDGTCADS 150 gi|132403|sp|P18902|RETB_BOVIN PAKFKMKYWGVASFLQKGNDDHWIIDTDYETFAVQYSCRLLNLDGTCADS 132 gi|5803139|ref|NP_006735.1| PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS 148 gi|6174963|sp|Q00724|RETB_MOUS PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150 gi|132407|sp|P04916|RETB_RAT PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150 ****************:*******:****:*:* ****** *********

Why “once a gap, always a gap”? There are many possible ways to make a MSA Where gaps are added is a critical question Gaps are often added to the first two (closest) sequences To change the initial gap choices later on would be to give more weight to distantly related sequences To maintain the initial gap choices is to trust that those gaps are most believable

Multiple sequence alignment to profile HMMs Hidden Markov models (HMMs) are “states” that describe the probability of having a particular amino acid residue at arranged in a column of a multiple sequence alignment HMMs are probabilistic models Like a hammer is more refined than a blast, an HMM gives more sensitive alignments than traditional techniques such as progressive alignments

GTWYA (hs RBP) GLWYA (mus RBP) GRWYE (apoD) GTWYE (E Coli) GEWFS (MUP4) An HMM is constructed from a MSA Example: five lipocalins

GTWYA GLWYA GRWYE GTWYE GEWFS Prob.12345 p(G)1.0 p(T)0.4 p(L)0.2 p(R)0.2 p(E)0.20.4 p(W)1.0 p(Y)0.8 p(F)0.2 p(A)0.4 p(S)0.2

GTWYA GLWYA GRWYE GTWYE GEWFS P(GEWYE) = (1.0)(0.2)(1.0)(0.8)(0.4) = 0.064 log odds score = ln(1.0) + ln(0.2) + ln(1.0) + ln(0.8) + ln(0.4) = -2.75 G:1.0 T:0.4 L:0.2 R:0.2 E:0.2 W:1.0 Y:0.8 F:0.2 E:0.4 A:0.4 S:0.2

BLOCKS (HMM) CDD (HMM) DOMO (Gapped MSA) INTERPRO iProClass MetaFAM Pfam (profile HMM library) PRINTS PRODOM (PSI-BLAST) PROSITE SMART Databases of multiple sequence alignments

Query = your favorite protein Database = set of many PSSMs CDD is related to PSI-BLAST, but distinct CDD searches against profiles generated from pre-selected alignments Purpose: to find conserved domains in the query sequence You can access CDD via DART at NCBI CDD uses RPS-BLAST: reverse position-specific

Multiple sequence alignment algorithms Progressive Iterative LocalGlobal PIMA DIALIGNSAGA CLUSTAL PileUp other

AMAS CINEMA ClustalW ClustalX DIALIGN HMMT Match-Box MultAlin MSA Musca PileUp SAGA T-COFFEE Multiple sequence alignment programs

Clustal X

GCG PileUp

Boxshade Alignment (“Pretty Shading”) Boxshade server= http://www.ch.embnet.org/software/BOX_form.html

[1] As percent identity among proteins drops, performance (accuracy) declines also. This is especially severe for proteins < 25% identity. Proteins <25% identity: 65% of residues align well Proteins <40% identity: 80% of residues align well Assessment of alternative multiple sequence alignment algorithms [2] “Orphan” sequences are highly divergent members of a family. Surprisingly, orphans do not disrupt alignments. Also surprisingly, global alignment algorithms outperform local.

Biology 224 Instructor: Tom Peavy October 18 & 20, 2010 <Images adapted from Bioinformatics and Functional Genomics by Jonathan Pevsner> Multiple Sequence.

Similar presentations

Presentation on theme: "Biology 224 Instructor: Tom Peavy October 18 & 20, 2010 <Images adapted from Bioinformatics and Functional Genomics by Jonathan Pevsner> Multiple Sequence."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Biology 224 Instructor: Tom Peavy October 18 & 20, 2010 <Images adapted from Bioinformatics and Functional Genomics by Jonathan Pevsner> Multiple Sequence.

Similar presentations

Presentation on theme: "Biology 224 Instructor: Tom Peavy October 18 & 20, 2010 <Images adapted from Bioinformatics and Functional Genomics by Jonathan Pevsner> Multiple Sequence."— Presentation transcript:

Similar presentations

About project

Feedback