Multiple Sequence Alignment

Multiple Sequence Alignment
What is it Why do we use it How to use it Tools ClustalW Exercise 1

This is known as multiple sequence alignment (msa)
Many genes are represented in highly conserved forms in a wide range of organisms Patterns of change in these gene sequences may be analyzed by simultaneous alignment of the sequences (identify conserved regions) This is known as multiple sequence alignment (msa) A multiple alignment arranges a set of sequences in a scheme where positions believed to be homologous are written in a common column.

Predict protein function
Applications of Multiple Sequence Alignment Predict protein function Predict protein structure (using structure superposition programs). Predict the evolutionary history of sequences (using phylogenetic analysis programs). Contig Assembly (Shotgun sequences & ESTs) Identify new family members Design PCR primers for amplification of related sequences Database searching with the consensus sequences to identify other sequences with a similar pattern.

Multiple Sequence Alignment Guidelines
Select the sequences carefully. Make sure they are members of the same family and they all share a common ancestor Use protein sequences if possible. Translate if necessary and then convert back to DNA after the alignment. Protein seqs are three times shorter and provide a more informative alphabet If there is little signal at the aa level there will be no signal at the nt level If you are interested in non-coding sequences you have no choice but beware DNA alignment is tricky (need a very high level of conservation)

Multiple Sequence Alignment Guidelines Cont.
Ensure that at least half of the sequences share more than 30% identity and avoid sequences that have > 90% identity to another sequence An alignment that contains only very similar sequences is not very informative If you make sure that each sequence is between 30 and 70% identical with half of the sequences in the set you will have made a reasonable compromise between new information and alignment quality

Start with sequences and avoid aligning more than 50 sequences (if you do employ a high level of manual curation) Multiple alignment programs are not good at handling large sets of sequences. Visualizing many alignments is difficult and if it falls on more than one page interpretation can become difficult if not impossible. Aligning a lot of sequences is computationally difficult and public servers have limited resources, so it may take a long time to run and make it difficult for you to fine tune alignment parameters or alternative sequences.

Tree building and structure prediction programs do not handle big alignments well Making accurate big alignments is difficult and not so reliable making it difficult to have confidence in the fidelity of the sequences that you are saying belong to a family. Best to start small and gradually increase the size of the multiple alignments. Before adding a sequence to a multiple alignment, you can figure out whether it is a good choice by doing a pairwise comparison.

Name Sequences appropriately
Multiple Sequence Alignment Guidelines Cont. Use sequences of similar length. Programs have problems aligning partial and complete sequences. Repeated domains are problematic for the alignment programs, especially if the number of domains is different. Name Sequences appropriately Never use white spaces such as clone 2 (clone2 or clone_2) Do not use special symbols, stick to plain letters, numbers and the underscore Do not use names any longer than 15 characters Use unique names for each sequence Use informative names (OSJLBa0001A01f compared to Main_Clone1)

EXPASY INTEGRATED BLAST &
MSA SERVER

EXPASY INTEGRATED BLAST &
MSA SERVER (databases and options)

Output of search displayed
Links to Pfam Scroll down

View Alignments (helps inform selection)

Make selections for inclusion in msa
Send your selections options Select your sequences in fasta format Send your selections options

Example selected sequences
Note the range of scores and E values selected ACC # SwissProt # Description Organism Score EXP P PRVA_HUMAN Parvalbumin alpha [PVALB] [Homo sapiens e-47 P PRVA_FELCA Parvalbumin alpha [PVALB] [Felis silvestris e-39 P PRVA_AMPME Parvalbumin alpha [Amphiuma (Salamand e-23 P PRVB_ESOLU Parvalbumin beta [Esox lucius (Northern pike)] e-19 P PRVU_CHICK Parvalbumin, thymic CPV3 (Parvalbumin 3) [G e-18 P ONCO_HUMAN Oncomodulin (OM) (Parvalbumin beta) [OCM] e-17 Q PRV1_SALSA Parvalbumin beta 1 (Major allergen Sal s 1) e-16 P PRVB_MERME Parvalbumin beta [Merluccius merluccius (Eu e-15 P PRVB_GADCA Parvalbumin beta (Allergen Gad c 1) (Gad c e-13

Multiple Sequence Alignment Software
ClustalW (Unix, Mac, PC, VMS). ClustalX (IGBMC , EBI) (graphical interface) (Unix, Mac, PC, VMS). Multalin MSA (Unix). DIALIGN (Unix). DCA (Unix). Multiple alignment by randomized iterative strategy (Unix). MACAW (Mac, PC). T-Coffee (Unix). MAFFT (Linux, Unix, Windows XP, Mac OS X).

Multiple Sequence Alignment Online Tools
ClustalW at EBI (Hinxton, UK). Display and edit alignments with JalView. ClustalW, Multalin at PBIL (Lyon, France). Colored alignments and secondary structure predictions. ClustalW, MAP, PIMA at BCM MSA, ClustalW, ctree at IBC (St Louis, USA) Multalin at INRA (Toulouse, France). Colored alignments. ClustalW, DCA, DIALIGN2 at Pasteur (Paris, France) ClustalW at EMBL (Heidelberg, Germany). Performs multiple alignment on homologous sequences detected by BLAST. ClustalW at DDBJ (Mishima, Japan) MAP (Michigan Tech. Univ., USA) ProbModel at CBRG (Zurich, Switzerland) DIALIGN2 at BiBiServ (Bielefeld, Germany) DCA at BiBiServ (Bielefeld, Germany) ITERALIGN (Stanford, USA) T-COFFEE (Lausanne, Switzerland) MATCH-BOX (Namur, Belgium) BLOCK Maker at FHCRC (Washington, USA) MEME at SDSC (San Diego, USA) MEME at Pasteur (Paris, France) PIMA II at BMERC (Boston, USA) MAVID at UCB (Berkeley, USA)

Most widely used msa program ClustalW is the latest version
Multiple Sequence Alignment Software: ClustalW First msa that could run on almost any platform Most widely used msa program ClustalW is the latest version There are many Clustal servers around the world, most operating the same version but their different interfaces provide access to different options. It is available as a stand-alone package also.

It compares two sequences at a time and clusters them by similarity.
Multiple Sequence Alignment Software: ClustalW CLustalW uses a progressive method to build its alignments It compares two sequences at a time and clusters them by similarity. This clustering resembles a phylogenetic tree (.dnd file from ClustalW output). This clustering is called as dendogram Reveals that A and B are more similar than C and D To make the progressive alignment ClustalW follows the dendogram and starts aligning A and B and then C and D. It then treats the multiple alignments like single sequences and aligns them two by two. A B Root C D

Multiple Sequence Alignment Software: ClustalW
Pairwise Scores This is the pairwise comparisons ClustalW uses to build its tree This can be ignored

Shows the alignment Can be saved as a text file Can view it in color

The Guide Tree Shows the tree that ClustalW uses to guide its progressive alignment It is displayed in Phylip tree format A cladogram is a branching diagram (tree) assumed to be an estimate of a phylogeny where the branches are of equal length, thus cladograms show common ancestry, but do not indicate the amount of evolutionary "time" separating taxa

The Phylogram Tree A Phylogram is a branching diagram (tree) assumed to be an estimate of a phylogeny, branch lengths are proportional to the amount of inferred evolutionary change

Interpreting Multiple Sequence Alignments
Interpreting an alignment is more art than a science !!| No E values exist to tell us how reliable the search was as in database searching Best method of evaluation is based on knowledge of protein structures. Structures contain loops that evolve rapidly Loops are softer portions of the protein that connect its more rigid portions Protein structures also contain core regions inside the protein that act as support walls for the protein. These support walls evolve less rapidly than the loops on the surface In a good multiple alignment can expect to find nice gap free blocks that correspond to core regions and gap rich regions that correspond to the loops

Interpreting Multiple Sequence Alignments
How Can you tell whether a block is good? Take a look at the alignment symbols * A star indicates an entirely conserved region : A colon indicates columns where all the residues have roughly the same size and same hydropathy . A period indicates columns where the size or hydropathy has been preserved in the course of evolution An average GOOD block is at least aa long exhibiting at least 1 to 3 stars, five to seven colons and a few periods In a good multiple alignment can expect to find nice gap free blocks that correspond to core regions and gap rich regions that correspond to the loops

BLAST Servers with integrated MSA’s
Multiple Sequence Alignment Tools BLAST Servers with integrated MSA’s Extract entire sequences Export sequences in FASTA format Submit sequences to ClustalW Submit sequences to Tcofee Extract sequence fragments srs.ebi.ac.uk

http://npsa-pbil. ibcp. fr/cgi-bin/npsa_automat. pl. page=npsa_blast

srs.ebi.ac.uk

Multiple Sequence Alignment

Similar presentations

Presentation on theme: "Multiple Sequence Alignment"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiple Sequence Alignment

Similar presentations

Presentation on theme: "Multiple Sequence Alignment"— Presentation transcript:

Similar presentations

About project

Feedback