Presentation is loading. Please wait.

Presentation is loading. Please wait.

9/20/2018 Marine Biological Laboratory, Woods Hole, MA Workshop on Molecular Evolution: multiple sequence analysis session July 28, 2009, 7 to 10 PM.

Similar presentations


Presentation on theme: "9/20/2018 Marine Biological Laboratory, Woods Hole, MA Workshop on Molecular Evolution: multiple sequence analysis session July 28, 2009, 7 to 10 PM."— Presentation transcript:

1 9/20/2018 Marine Biological Laboratory, Woods Hole, MA Workshop on Molecular Evolution: multiple sequence analysis session July 28, 2009, 7 to 10 PM

2 Multiple Sequence Alignment & Analysis with SeaView and MAFFT
9/20/2018 Multiple Sequence Alignment & Analysis with SeaView and MAFFT Steven M. Thompson More data yields stronger analyses — if done carefully! The patterns of conservation become ever clearer by comparing the conserved portions of sequences amongst a larger and larger dataset. Mosaic ideas and evolutionary ‘importance.’ The power and sensitivity of sequence based computational methods dramatically increases with the addition of more data. More data yields stronger analyses — if done carefully! Otherwise, it can confound the issue. The patterns of conservation become clearer by comparing the conserved portions of sequences amongst a larger and larger dataset. Those areas most resistant to change are functionally the most important to the molecule. The basic assumption is that those portions of sequence of crucial functional value are most constrained against evolutionary change. They will not tolerate many mutations. Not that mutations do not occur in these portions, just that most mutations in the region are lethal so we never see them. Other areas of sequence are able to drift more readily being less subject to evolutionary pressure. Therefore, sequences end up a mosaic of quickly and slowly changing regions over evolutionary time.

3 But first a prelude: My definitions
9/20/2018 But first a prelude: My definitions Biocomputing and computational biology are synonymous and describe the use of computers and computational techniques to analyze any biological system, from molecules, through cells, tissues, organisms, and populations, to complete ecologies. Bioinformatics describes using computational techniques to access, analyze, and interpret the biological information in any of the available online biological databases. Sequence analysis is the study of molecular sequence data for the purpose of inferring the function, mechanism, interactions, evolution, and perhaps structure of biological molecules. Genomics analyzes the context of genes or complete genomes (the total DNA content of an organism) within and across genomes. Proteomics is a subdivision of genomics concerned with analyzing the complete protein complement, i.e. the proteome, of organisms, both within and between different organisms.

4 And a ‘way’ to think about it: The reverse biochemistry analogy
9/20/2018 And a ‘way’ to think about it: The reverse biochemistry analogy from a ‘virtual’ DNA sequence to actual molecular physical characterization, not the other way ‘round. Using bioinformatics tools, you can infer all sorts of functional, evolutionary, and, structural insights into a gene product, without the need to isolate and purify massive amounts of protein! Eventually you can go on to clone and express the gene based on that analysis using PCR techniques. The computer and molecular databases are an essential part of this process.

5 The exponential growth of molecular sequence databases
Steve Thompson 9/20/2018 9/20/2018 The exponential growth of molecular sequence databases & cpu power Year BasePairs Sequences Doubling time about a year and half! 5

6 Now then, why even bother — Applicability?
9/20/2018 Now then, why even bother — Applicability? Molecular evolutionary analysis; plus Probe/primer, and motif/profile design; Graphical illustrations; and Comparative ‘homology’ inference. OK — here’s some examples. Applicability? So what’s so great about multiple sequence alignments; why would anyone want to bother? They are: very useful in the development of primers and probes as well as in motif discovery; great for producing annotated, publication quality, graphics and illustrations; invaluable in structure/function studies through homology inference; and required for molecular evolutionary phylogenetic inference programs. Alignments help with primer design and motif discovery by allowing you to visualize the most conserved regions. Any level of specificity can be achieved by picking areas of high variability in the overall dataset that correspond to areas of high conversation in subset datasets to differentiate between universal and specific probe sequences. Graphics prepared from alignments can dramatically illustrate functional and structural conservation. These can take many forms of all or portions of an alignment — shaded or colored boxes or letters for each residue, cartoon representations of features, running line graphs of overall similarity, overlays of attributes, various consensus representations — all can be printed with high-resolution equipment, usually in color or gray tones. Conserved regions of an alignment are functionally important. Structure is also conserved in these crucial regions. In fact, recognizable structural conservation between true homologues extends way beyond statistically significant sequence similarity. An oft-cited example is in the serine protease superfamily. S. griseus protease A demonstrates remarkably little similarity when compared to the rest of the superfamily (Expectation values E()101.8 in a typical search) yet its three-dimensional structure clearly shows its allegiance to the serine proteases (Pearson, W.R., personal communication). These principles are the premise of ‘homology modeling.’ Alignments are used to infer phylogeny. Based on the assertion of homologous positions, programs such as those in PAUP* (Phylogenetic Analysis Using Parsimony [and other methods]) and PHYLIP (PHYLogeny Inference Package) estimate the most reasonable evolutionary tree for that alignment. This is a huge, complicated, and highly contentious field. (See the Woods Hole Marine Biological Laboratory’s excellent summer course, the Workshop on Molecular Evolution, at However, always remember that regardless of algorithm used, parsimony, any distance method, maximum likelihood, or even Bayesian Inference, all molecular sequence phylogenetic inference programs make the absolute validity of your input alignment their first and most critical assumption.

7 Molecular evolution and phylogenetics
9/20/2018 Molecular evolution and phylogenetics We all know multiple sequence alignments are necessary for phylogenetic inference, but does everybody here truly realize that the absolute positional homology of every column in a data matrix passed on to these programs is the most critical assumption that all the algorithms make (but see Bayesian coestimation)!

8 And what about this other stuff?
9/20/2018 And what about this other stuff? Multiple sequence alignments can be indispensable for primer design when you don’t have data on a particular taxa, yet data is available in related taxa. The conservation and variability within an alignment can help guide the design of universal or taxa specific primers.

9 9/20/2018 Here’s an HPV L1 example The ellipses show areas where PCR primers could differentiate the Type 16 clade from it’s closest relatives — areas of high L1 conservation in the Type 16 clade (red line) that correspond to areas of much weaker conservation in the others (blue line).

10 Motif and profile definition
9/20/2018 Motif and profile definition An alignment of human SRY/SOX proteins illustrates the conservation of the HMG box. Conserved regions can be visualized with a sliding window approach and appear as peaks. Motifs and (better yet) HMM profiles can be created of the region to be used as a search tool to find other HMG box proteins. HMG box

11 9/20/2018 One picture’s worth . . . The HMG-box domain is strikingly conserved amongst the otherwise nearly unalignable human DNA regulatory paralogous protein family.

12 Structure/function homology inference
9/20/2018 Structure/function homology inference A Swiss-Model homology based model of Giardia EF1 superimposed over its eight most similar sequences with solved structure. Amazingly accurate inferences of both function and structure are possible using comparative methods.

13 9/20/2018 On to aligning multiple sequences — dynamic programming’s complexity increases exponentially with the number of sequences being compared: N-dimensional matrix complexity O ( [sequence length]number of sequences ) As seen in pairwise dynamic programming, looking at every possible position by sliding one sequence along every other sequence, just will not work for alignment. Therefore, dynamic programming reduces the problem back down to N2. But how do you work with more than just two sequences at a time? It becomes a much harder problem. You could painstakingly manually align all your sequences using some type of editor, and many people do just that, but some type of an automated solution is desirable, at least as a starting point to manual alignment. However, solving the dynamic programming algorithm for more than just two sequences rapidly becomes intractable. Dynamic programming’s complexity, and hence its computational requirements, increases exponentially with the number of sequences in the dataset being compared (complexity=[sequence length]number of sequences). Mathematically this is an N-dimensional matrix, quite complex indeed. Pairwise dynamic programming solves a two-dimensional matrix, and the complexity of the solution is equal to the length of the longest sequence squared. Well, a three sequence dynamic programming comparison would be a matrix with three axes, the length of the longest sequence cubed, and so forth. You can at least draw a three-dimensional matrix, but more than that becomes impossible to even visualize. It quickly boggles the mind!

14 A couple ‘global’ solutions using heuristic tricks
9/20/2018 A couple ‘global’ solutions using heuristic tricks See — MSA (‘global’ within ‘bounding box’) and PIMA (‘local’ portions only) on the multiple alignment page at the Both available at the Baylor College of Medicine’s Search Launcher — but, severely limiting restrictions! Multiple Sequence Dynamic Programming. Several different heuristics have been employed over the years to simplify the complexity of the problem. One program, MSA (Gupta et al. [version 2.0, 1995] and version 2.1), does attempt to globally solve the N-dimensional matrix equation using a bounding box trick. However, the algorithm’s complexity precludes its use in most situations, except with very small datasets. One way to still globally solve the algorithm and yet reduce its complexity is to restrict the search space to only the most conserved ‘local’ portions of all the sequences involved. This approach is used by the program PIMA (Smith and Smith, version 1.4, 1995). MSA and PIMA are both available through the Internet at several bioinformatics servers (in particular the Baylor College of Medicine’s Search Launcher at

15 Therefore — pairwise, progressive dynamic programming . . .
9/20/2018 Therefore — pairwise, progressive dynamic programming . . . . . . restricts the solution to the neighbor-hood of only two sequences at a time. All sequences are compared, pairwise, and then each is aligned to its most similar partner or group of partners represented as a consensus. Each group of partners is then aligned to finish the complete multiple sequence alignment. How the Algorithm Works. Therefore, the most common implementations of automated multiple alignment modify dynamic programming by establishing a pairwise order in which to build the alignment. This modification is known as pairwise, progressive dynamic programming. Originally attributed to Feng and Doolittle (1987), this variation of the dynamic programming algorithm generates a global alignment, but restricts its search space at any one time to a local neighborhood of the full length of only two sequences. Consider a group of sequences. First all are compared to each other, pairwise, using normal dynamic programming. This establishes an order for the set, most to least similar. Subgroups are clustered together similarly. Then take the top two most similar sequences and align them using normal dynamic programming. Now create a consensus of the two and align that consensus to the third sequence using standard dynamic programming. Now create a consensus of the first three sequences and align that to the forth most similar. This process continues until it has worked its way through all sequences and/or sets of clusters. The pairwise, progressive solution is implemented in several programs. Perhaps the most popular is Higgins’ and Thompson’s ClustalW (1994) and its multi-platform, graphical user interface ClustalX (Thompson, et al., 1997). ClustalX has versions available for most windowing computing Operating Systems — most UNIX flavors, Microsoft Windows, and Macintosh. The ClustalX homesite guarantees the latest version: ftp://ftp-igbmc.u-strasbg.fr/ClustalX/. The GCG program PileUp implements a very similar method within the Wisconsin Package.

16 Enhancements on the theme
9/20/2018 This was pretty much the original ClustalV and GCG’s PileUp program then . . . Enhancements on the theme First enhancements came from ClustalW — variable sequence weighting, dynamically varying gap penalties and substitution matrices, and a neighbor-joining guide-tree. Since the year 2000 a slew of new programs have tried other heuristic variations, all in attempts to build faster, more accurate multiple sequence alignments. The devil’s in the details: Muscle, ProbCons, T-Coffee, MAFFT and many, many more.

17 9/20/2018 Muscle An iterative method that uses weighted log-expectation profile scoring along with a slew of optimizations. It proceeds in three stages — draft progressive using k-mer counting, improved progressive using a revised tree from the previous iteration, and refinement by sequential deletion of each tree edge with subsequent profile realignment. ProbCon Uses Hidden Markov Model (HMM) techniques and posterior probability matrices that compare random pairwise alignments to expected pairwise alignments. Probability consistency transformation is used to reestimate the scores, and a guide-tree is then constructed, which is used to compute the alignment, which is then iteratively refined. Incredibly accurate.

18 9/20/2018 T-Coffee Uses a preprocessed, weighted library of all the pairwise global alignments between your sequences, plus the ten best local alignments associated with each pair. This helps build the NJ guide-tree and the progressive alignment. The library is used to assure consistency and help prevent errors, by allowing ‘forward-thinking’ to see whether the overall alignment will be better one way or another after particular segments are aligned one way or another. The institutional schedule analogy T-Coffee can even tie together multiple methods as external modules, making consistency libraries from the results of each, as long as all the specified methods are installed on your system. T-Coffee is one of the most accurate multiple sequence alignment methods available because of this consistency based rationale, but it is not the fastest. Regardless, I encourage you to check it out!

19 MAFFT — today’s example
9/20/2018 MAFFT — today’s example — has many modes, among them: a couple of progressive, approximate modes, using a fast Fourier transformation (FFT); a couple of iteratively refined methods that add in weighted-sum-of-pairs (WSP) scoring; and several iterative methods that use WSP scoring combined with a T-Coffee-like consistency based scoring scheme. Speed and accuracy are inversely proportional for these from fast and rough, to slow and accurate, respectively. MAFFT provides command aliases for all of these, from fast to slow — FFTNS with or without retree, FFTNSI with or without maxiterate, and the three combined approaches EINSI, LINSI, and GINSI.

20 MAFFT’s basic algorithm
9/20/2018 MAFFT’s basic algorithm MAFFT’s fast Fourier transform provide a huge speedup over previous methods. Homologous regions are quickly identified by converting amino acid residues to vectors of volume and polarity, thus changing a twenty-character alphabet to six, rather than by using an amino acid similarity matrix. Similarly, nucleotide bases are converted to vectors of imaginary and complex numbers. The FFT trick then reduces the complexity of the subsequent comparison to O ( N logN ). FFT identifies potential similarities though, without localizing them; a sliding window step using the BLOSUM62 matrix is used for this. Then MAFFT constructs a distance matrix, and hence a progressive guide tree, on the number of shared six-tuples from this Fourier transform, rather than on a ranking based on full-length, pairwise sequence similarity. The user can specify how many times a new guide tree is subsequently recalculated from a previous alignment as many times as desired; the alignment is reconstructed using the Needlman Wunsch algorithm each time.

21 Some of MAFFT’s many modes
9/20/2018 Some of MAFFT’s many modes And each mode has a bunch of additional options! 1) Most basic, fastest modes — just progressive. a) FFTNS1 (fftns --retree 1) b) FFTNS2 (fftns) (same as mafft --retree 2) Suitable for 1,000’s of easily aligned sequences. A rough distance matrix is built from the sequences using FFT and the shared number of six-mers. A modified UPGMA guide tree is built from this matrix. The sequences are aligned according to the rough, initial guide tree (as in ‘traditional’ methods). FFTNS2 adds a recomputation of the guide tree (retree 2) from the original alignment, from which a new progressive alignment is built.

22 MAFFT’s interative refinements
9/20/2018 MAFFT’s interative refinements 2) Intermediate modes — progressive + iterations to maximize the WSP objective function. a) FFTNSI (fftnsi) default two cycles, or e.g. fftnsi --maxiterate 1000 b) NWNSI (nwnsi) same as FFTNSI, but no FFT, Needleman Wunsch only. Progressive alignment and retree as before, with or without FFT, and then Iterative refinement is cycled twice (default), or repeatedly until there is no further improvement, or until you reach your specified limit number. Suitable for 100’s through 1000’s of sequences.

23 MAFFT’s most accurate modes
9/20/2018 MAFFT’s most accurate modes 3) Advanced modes — progressive + iterations to maximize the objective WSP and T-Coffee-like consistency functions. Options differ according to the way the pairwise alignments are calculated. a) EINSI (einsi) most general of these. Uses a Smith Waterman style local algorithm with generalized affine gap costs for the pairwise step. Most appropriate for sequences with multi- shared, similarly ordered domains, in an otherwise nearly unalignable ‘mess,’ .e.g: ooooooXXX------XXXX XXXXXXXXXXX-XXXXXXXXXXXXXXXoooooooooo ------XXXXXXXXXXXXXooo XXXXXXXXXXXXXXXXXX-XXXXXXXX --ooooXXXXXX---XXXXooooooooooo XXXXX----XXXXXXXXXXXXXXXXXXoooooooooo ------XXXXX----XXXXoooooooooooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX ------XXXXX----XXXX XXXXX---XXXXXXXXXX--XXXXXXXooooo-----

24 MAFFT’s most accurate modes, cont.
9/20/2018 MAFFT’s most accurate modes, cont. 3) Advanced modes — progressive + iterations to maximize the objective WSP and T-Coffee-like consistency functions. Options differ according to the way the pairwise alignments are calculated. b) LINSI (linsi) strictly local. Uses a Smith Waterman style local algorithm with affine gap costs for the pairwise step. Most appropriate for sequences with only one single, shared domain, in an otherwise nearly unalignable ‘mess,’ .e.g: XXXXXXXXXXX-XXXXXXXXXXXXXXXoooooooooo XXXXXXXXXXXXXXXXXX-XXXXXXXX XXXXX----XXXXXXXXXXXXXXXXXXoooooooooo ooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX XXXXX---XXXXXXXXXX--XXXXXXXooooo-----

25 MAFFT’s most accurate modes, cont.
9/20/2018 MAFFT’s most accurate modes, cont. 3) Advanced modes — progressive + iterations to maximize the objective WSP and T-Coffee-like consistency functions. Options differ according to the way the pairwise alignments are calculated. c) GINSI (ginsi) strictly global. Uses a Needleman Wunsch style global algorithm with affine gap costs for the pairwise step. Most appropriate for sequences where only one single, shared domain extends the full length of all of the sequences, .e.g: XXXXXXXXXXXXXXX-XXXXXXXXXXXXXXXooooXXXooXXX -XXXXXXXXXXXXXXXXXX-XXXXXXXX--XXXXXXX---XXX XX--XXXXX---XXXXXXXXXXXXXXXXXXXoooooXXoooXX ooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXXX- XXXXX---XXXXXXXXXX--XXXXXXXoooooXXXXXXXXX--

26 How to know when to use what
9/20/2018 How to know when to use what for MAFFT — see “tips,” 2,3, and 4 pages, for all of them — Take home message: For simple cases it doesn’t really matter what program to use. For complicated situations it may, and what you use will depend on the size of your dataset, personal preferences, time allotted, and how much hand editing you want to do. Really nice, fairly recent review: Edgar, R.C. and Batzoglou, S. (2006) Multiple sequence alignment. Current Opinion in Structural Biology 16, 368–373. The rest of my references can be found in my tutorial manuscript.

27 9/20/2018 You can do a lot of this stuff on the Web, if you need to — some resources for multiple sequence alignment: However, problems with very large datasets and huge multiple alignments make doing multiple sequence alignment on the Web impractical after your dataset has reached a certain size. You’ll know it when you’re there! Biocomputing sites around the globe on the World Wide Web (WWW) provide access to multiple alignment resources. In general Web resources for multiple alignment aren’t as easy to use nor as powerful as performing multiple alignment locally on either your own office machine or on a local dedicated sequence analysis server. Some of the difficulty comes from limits in Web interface scripting and forms capabilities, and cut-and-paste errors, but also just the unreliability of Internet connections in general. In spite of that warning, it is possible, and relatively easy to take advantage of multiple sequence resources available on the Internet through the WWW. However, problems with very large datasets and huge multiple alignments make doing multiple sequence alignment on the Web impractical after your dataset has reached a certain size. You’ll recognize that size very quickly when you’ve reached it! One of the most comprehensive collections is at the Bielefeld University Virtual School of Natural Sciences BioComputing Division (VSNS-BCD) in Germany: Another very good one is at the PBIL (Pôle Bio-Informatique Lyonnais) World Wide Web server in Lyon, France ( and the European Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL EBI) in Hinxton U.K. has a slick interface to ClustalW ( In the U.S.A. the previously mentioned Baylor College of Medicine Search Launcher ( is also available.

28 Therefore, I argue for UNIX server-based solutions . . .
9/20/2018 If large datasets become intractable for analysis on the Web, what other resources are available? Desktop software solutions — all of these programs are available in public domain open source, but they can be complicated to install, configure, and maintain. User must be pretty computer savvy. So, commercial software packages are available, e.g. MacVector, DS Gene, DNAsis, DNAStar, etc., but license hassles, big expense per machine, lack of most recent programs, underperformance, and Internet and/or CD database access all complicate matters! Therefore, I argue for UNIX server-based solutions . . .

29 UNIX servers — pros and cons
9/20/2018 UNIX servers — pros and cons Free/public domain solutions still available, but now a very cooperative systems manager needs to maintain everything for users. If you have such a person, then: You end up with a more powerful, and usually faster computer, with larger storage capabilities. Plus, connections can be made from any networked terminal or workstation anywhere! Operating system: UNIX command line operation hassles; communications software — telnet, ssh, and terminal emulation; X graphics; file transfer — ftp, and scp/sftp; and editors — vi, emacs, pico/nano (or desktop word processing followed by file transfer [save as "text only!"]). See my supplement pdf file.

30 Reliability and the Comparative Approach —
9/20/2018 Reliability and the Comparative Approach — explicit homologous correspondence; manual adjustments should be encouraged — based on knowledge, especially structural, regulatory, and functional sites. Therefore, editors like SeaView and databases like the Ribosomal Database Project: Reliability? To help assure the reliability of sequence alignments always use comparative approaches. A multiple sequence alignment is a hypothesis of evolutionary history. Insure that you have prepared a good one — be sure that it makes sense. Think about it — a sequence alignment is a statement of positional homology. It establishes the explicit homologous correspondence of each individual sequence position, each column in the alignment. Therefore, devote considerable time and energy toward developing the most satisfying multiple sequence alignment possible. This includes adjusting alignments manually based on your knowledge of the biological system being studied. Researchers have successfully used the conservation of covarying sites in ribosomal and other structural RNA alignments to assist in alignment refinement. That is, as one base in a stem structure changes the corresponding Watson-Crick paired base will change in a corresponding manner. This process has been used extensively by the Ribosomal Database Project at the Center for Microbial Ecology at Michigan State University to help guide the construction of their rRNA alignments and structures. The WWW Uniform Resource Locator (URL) is

31 Coding DNA issues Work with proteins! If at all possible.
9/20/2018 Coding DNA issues Work with proteins! If at all possible. Twenty match symbols versus four, plus similarity versus identity! Way better signal to noise. Also guarantees no indels are placed within codons. So translate, then align. Nucleotide sequences will only reliably align if they are very similar to each other. And they will likely require extensive and carefully considered hand editing with an editor like SeaView.

32 Beware of aligning apples and oranges [and grapefruit]!
9/20/2018 Beware of aligning apples and oranges [and grapefruit]! receptors and/or activators with their namesake proteins; parologous versus orthologous; genomic versus cDNA; mature versus precursor. Be sure an alignment makes biological sense — align things that make sense to align! Beware of comparing ‘apples and oranges.’ If creating alignments for phylogenetic inference, either make paralogous comparisons (i.e. evolution via gene duplication) to ascertain gene phylogenies within one organism, or orthologous (within one ancestral loci) comparisons to ascertain gene phylogenies between organisms which should imply organismal phylogenies. Try not to mix them up without complete data representation. Lots of confusion can arise, especially if you do not have all the data and/or if the nomenclature is contradictory; extremely misleading interpretations can result. Be wary of trying to align genomic sequences with cDNA when working with DNA; the introns will cause all sorts of headaches. Similarly, do not align mature and precursor proteins from the same organism and loci. It does not make evolutionary sense, as one is not evolved from the other, rather one is the other. These are all easy mistakes to make; try your best to avoid them.

33 Mask out uncertain areas —
9/20/2018 Mask out uncertain areas — I reiterate, the most important factor in inferring reliable phylogenies is the accuracy of the multiple sequence alignment. The interpretation of your results is utterly dependent on the quality of your input. In fact, many experts advice against using any parts of the sequence data that are at all questionable. Only analyze those portions that assuredly do align. If any portions of the alignment are in doubt, throw them out. This usually means trimming down or masking out the alignment’s terminal ends and may require internal trimming or masking as well. Biocomputing is always a delicate balance — signal against noise — and sometimes it can be quite the balancing act! Remember the old adage “garbage in — garbage out!” Some general guidelines to remember include the following: If the homology of a region is in doubt, then throw it out (or “mask” it, as can be done using SeqLab). Avoid the most diverged parts of molecules; they are the greatest source of systematic error. Do not include sequences that are more diverged than necessary for the analysis at hand.

34 Complications — Order dependence. Not that big of a deal.
9/20/2018 Complications — Order dependence. Not that big of a deal. Substitution matrices and gap penalties. Can be a very big deal! Regional ‘realignment’ becomes incredibly important, especially with sequences that have areas of high and low similarity. SeaView let’s you do this! Complications. One liability of global progressive, pairwise methods is they are entirely dependent on the order in which the sequences are aligned. Fortunately ordering them from most similar to least similar usually makes biological sense and works very well. However, the techniques are very sensitive to the substitution matrix and gap penalties specified. Programs such as ClustalW and PileUp that allow ‘fine-tuning’ areas of an alignment by re-alignment with different scoring matrices and/or gap penalties can be extremely helpful because of this. In PileUp this is achieved with the -InSitu option. This is particularly true in cases where your sequence dataset has areas of drastically different similarity, some very high and others quite low, e.g. transmembrane proteins. However, any automated multiple sequence alignment program should be thought of as only a tool to offer a starting alignment that can be improved upon, not the ‘end-all-to-meet-all’ solution, guaranteed to provide the ‘one-true’ answer.

35 Complications cont. — Format hassles!
9/20/2018 Complications cont. — Format hassles! Specialized format conversion tools such as GCG’s SeqConv+ program and PAUPSearch, and Don Gilbert’s public domain ReadSeq program. Plus, some programs like SeaView can read and write several formats. One of the biggest problems in computational biology is that of molecular sequence data format. Each suite of programs to come along seems to require its own different sequence format. The major databases all have their own; Clustal has its own; even the database similarity searching program FastA has a sequence format associated with it. GCG Wisconsin Package sequence format exists both as single and Multiple Sequence Format (MSF) and GCG’s SeqLab has its own format called Rich Sequence Format (RSF) that contains both sequence data and reference and feature annotation. PAUP* has a required format called the NEXUS file and PHYLIP has its own unique input data format requirements. The PAUP* interfaces in the GCG Wisconsin Package, PAUPSearch and PAUPDisplay, automatically generate their required NEXUS format directly from the GCG formatted files. Most systems are not nearly so helpful. Several different programs are available to convert formats back and forth between the required standards, but it all can get quite confusing. One program available, ReadSeq by Don Gilbert at Indiana University (1990), allows for the back and forth conversion between several different formats. I would heartily recommend installing it on all of your computers. It comes as an old ‘tried-and-trued’ C version or a new JAVA version with a graphical interface. I don’t have much experience with the JAVA version but have relied on the C version for many years.

36 Still more complications —
9/20/2018 Still more complications — Indels and missing data symbols (i.e. gaps) designation discrepancy headaches — ., -, ~, ?, N, or X Help! Alignment gaps are another problem. Different program suites may use different symbols to represent them. Most programs use hyphens, “-”, the Wisconsin Package uses periods, “.”. Furthermore, not all gaps in sequences should be interpreted as deletions. Interior gaps are probably okay to represent this way, as regardless of whether a deletion, insertion or a duplication event created the gap, logically they will be treated the same by the algorithms. These are indels. However, end gaps should not be represented as indels because a lack of information beyond the length of a given sequence may not be due to a deletion or insertion event. It may have nothing to do with the particular stretch being analyzed at all. It may just not have been sequenced! These gaps are just place holders for the sequence. Therefore, it is safest to manually edit an alignment to change leading and trailing gap symbols to “x”’s which mean “unknown amino acid,” or “n”’s which mean “unknown base,” or “?”’s which is supported by many programs, but not all, and means “unknown residue or indel.” This will assure that the programs do not make incorrect assumptions about your sequences.

37 Conclusions — 9/20/2018 Gunnar von Heijne in his very old but quite readable treatise, Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit (1987), provides a very appropriate conclusion: “Think about what you’re doing; use your knowledge of the molecular system involved to guide both your interpretation of results and your direction of inquiry; use as much information as possible; and do not blindly accept everything the computer offers you.” He continues: “. . . if any lesson is to be drawn it surely is that to be able to make a useful contribution one must first and foremost be a biologist, and only second a theoretician We have to develop better algorithms, we have to find ways to cope with the massive amounts of data, and above all we have to become better biologists. But that’s all it takes.” How can we use all the information contained in a multiple sequence alignment to explore Russell Doolittle’s “Twilight Zone,” i.e. those similarities below ~20% identity, those Z scores below ~5, those BLAST/Fast E values above ~10-5 or so? Just because a similarity score between two sequences is quite low, we do not automatically know that the two structures do not fold in a similar manner and perform a similar function, nor do we know whether they are truly homologous! Obviously much of the information in a multiple sequence alignment is “noise” at this similarity level. Profiles are a position specific weight matrix of a multiple sequence alignment or a portion of an alignment, i.e. a weighted two-dimensional description. The more highly conserved a residue is, the more important it becomes. Furthermore, gap insertion is penalized more heavily in conserved areas of the alignment than it is in variable regions. Originally described by Gribskov et al. (1987), later refinements have added more statistical rigor (see e.g. Expectation Maximization [Bailey and Elkan, 1994] and Eddy’s Hidden Markov Model Profiles [1996 and 1998). Generally, a profile is created from an alignment of related sequences or regions within sequences and then used to search databases for remote sequence similarities. Profile searching is tremendously powerful and can provide the most sensitive, albeit extremely computationally intensive, database similarity searches possible. FOR MORE INFO... Explore my Web Home: Contact me for specific long-distance bioinformatics assistance and collaboration.

38 9/20/2018 On to a demonstration of some of SeaView’s multiple sequence dataset capabilities — The HPV L1 gene and complete genome the tutorial: How to use SeaView with MAFFT.


Download ppt "9/20/2018 Marine Biological Laboratory, Woods Hole, MA Workshop on Molecular Evolution: multiple sequence analysis session July 28, 2009, 7 to 10 PM."

Similar presentations


Ads by Google