Presentation is loading. Please wait.

Presentation is loading. Please wait.

Arun Goja MITCON BIOPHARMA

Similar presentations


Presentation on theme: "Arun Goja MITCON BIOPHARMA"— Presentation transcript:

1 Arun Goja MITCON BIOPHARMA
Sequence Alignment Arun Goja MITCON BIOPHARMA

2 Why do we want to compare sequences?
Evolutionary relationships Phylogenetic trees can be constructed based on comparison of the sequences of a molecule (example: 16S rRNA) taken from different species Residues conserved during evolution play an important role Prediction of protein structure and function Proteins which are very similar in sequence generally have similar 3D structure and function as well By searching a sequence of unknown structure against a database of known proteins the structure and/or function can in many cases be predicted

3 WHY ? sequence alignment Sequence alignment is important for:
* prediction of function * database searching * gene finding * sequence divergence * sequence assembly

4 Over time, genes accumulate mutations
Environmental factors Radiation Oxidation Mistakes in replication or repair Deletions, Duplications Insertions Inversions Point mutations

5 Deletions Codon deletion: ACG ATA GCG TAT GTA TAG CCG…
Effect depends on the protein, position, etc. Almost always deleterious Sometimes lethal Frame shift mutation: ACG ATA GCG TAT GTA TAG CCG… ACG ATA GCG ATG TAT AGC CG?… Almost always lethal

6 Indels Comparing two genes it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known: ACGTCTGATACGCCGTATCGTCTATCT ACGTCTGAT---CCGTATCGTCTATCT

7 Comparing two sequences
Point mutations, easy: ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGATTCGCCCTATCGTCTATCT Indels are difficult, must align sequences: ACGTCTGATACGCCGTATAGTCTATCT CTGATTCGCATCGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT ----CTGATTCGC---ATCGTCTATCT

8 Causes for sequence (dis)similarity
mutation: a nucleotide at a certain location is replaced by another nucleotide (e.g.: ATA → AGA) insertion: at a certain location one new nucleotide is inserted inbetween two existing nucleotides (e.g.: AA → AGA) deletion: at a certain location one existing nucleotide is deleted (e.g.: ACTG → AC-G) indel: an insertion or a deletion

9 Definition Homology: related by descent Homologous sequence positions
ATTGCGC  ATTGCGC ATTGCGC AT-CCGC ATTGCGC C  ATCCGC

10 Orthologous and paralogous
Orthologous sequences differ because they are found in different species (a speciation event) Paralogous sequences differ due to a gene duplication event Sequences may be both orthologous and paralogous

11 Sequence alignment - meaning
Sequence alignment is used to study the evolution of the sequences from a common ancestor such as protein sequences or DNA sequences. Mismatches in the alignment correspond to mutations, and gaps correspond to insertions or deletions. Sequence alignment also refers to the process of constructing significant alignments in a database of potentially unrelated sequences.

12 Sequence alignment - definition
Sequence alignment is an arrangement of two or more sequences, highlighting their similarity. The sequences are padded with gaps (dashes) so that wherever possible, columns contain identical characters from the sequences involved tcctctgcctctgccatcat---caaccccaaagt |||| ||| ||||| ||||| |||||||||||| tcctgtgcatctgcaatcatgggcaaccccaaagt

13 Pairwise alignment: the problem
The number of possible pairwise alignments increases explosively with the length of the sequences: Two protein sequences of length 100 amino acids can be aligned in approximately 1060 different ways Time needed to test all possibilities is same order of magnitude as the entire lifetime of the universe.

14 Pairwise Alignment The alignment of two sequences (DNA or protein) is a relatively straightforward computational problem. There are lots of possible alignments. Two sequences can always be aligned. Sequence alignments have to be scored. Often there is more than one solution with the same score.

15 Methods of Alignment By hand - slide sequences on two lines of a word processor Dot plot with windows Rigorous mathematical approach Dynamic programming (slow, optimal) Heuristic methods (fast, approximate) BLAST and FASTA Word matching and hash tables0

16 Align by Hand GATCGCCTA_TTACGTCCTGGAC <-- --> AGGCATACGTA_GCCCTTTCGC You still need some kind of scoring system to find the best alignment

17 Percent Sequence Identity
The extent to which two nucleotide or amino acid sequences are invariant A C C T G A G – A G A C G T G – G C A G mismatch indel 70% identical

18 Dotplot: A dotplot gives an overview of all possible alignments
A     T     T     C    A     C    A     T     A     T A C A T T A C G T A C Sequence 2 Sequence 1

19 Dotplot: In a dotplot each diagonal corresponds to a possible (ungapped) alignment A     T     T     C    A     C    A     T     A     T A C A T T A C G T A C Sequence 2 Sequence 1 T A C A T T A C G T A C A T A C A C T T A One possible alignment:

20 Insertions / Deletions in a Dotplot
Sequence 2 T A C G T A C T G T T C A T Sequence 1 T A C T G - T C A T | | | | | | | | | T A C T G T T C A T

21 Alignment methods Heuristic algorithms (faster but approximate)
Rigorous algorithms = Dynamic Programming Needleman-Wunsch (global) Smith-Waterman (local) Heuristic algorithms (faster but approximate) BLAST FASTA

22 Pairwise alignment Pairwise sequence alignment methods are concerned with finding the best-matching piecewise local or global alignments of protein (amino acid) or DNA (nucleic acid) sequences. Typically, the purpose of this is to find homologues (relatives) of a gene or gene-product in a database of known examples. This information is useful for answering a variety of biological questions: 1. The identification of sequences of unknown structure or function. 2. The study of molecular evolution.

23 Dynamic Programming Approach to
Sequence Alignment The dynamic programming approach to sequence alignment always tries to follow the best prior-result so far. Try to align two sequences by inserting some gaps at different locations, so as to maximize the score of this alignment. Score measurement is determined by "match award", "mismatch penalty" and "gap penalty". The higher the score, the better the alignment. If both penalties are set to 0, it aims to always find an alignment with maximum matches so far. Maximum match = largest number matches can have for one sequence by allowing all possible deletion of another sequence. It is used to compare the similarity between two sequences of DNA or Protein, to predict similarity of their functionalities. Examples: Needleman-Wunsch(1970), Sellers(1974), Smith-Waterman(1981)

24 Global alignment A global alignment between two sequences is an alignment in which all the characters in both sequences participate in the alignment. Global alignments are useful mostly for finding closely-related sequences.

25 Global Alignment Global algorithms are often not effective for highly diverged sequences and do not reflect the biological reality that two sequences may only share limited regions of conserved sequence. Sometimes two sequences may be derived from ancient recombination events where only a single functional domain is shared. Global alignment is useful when you want to force two sequences to align over their entire length

26 Global Alignment Find the global best fit between two sequences
Example: the sequences s = VIVALASVEGAS and t = VIVADAVIS align like: A(s,t) = V I V A L A S V E G A S | | | | | | | V I V A D A - V - - I S indels

27 The Needleman-Wunsch algorithm
The Needleman-Wunsch algorithm performs a global alignment on two sequences (s and t) and is applied to align protein or nucleotide sequences. The Needleman-Wunsch algorithm is an example of dynamic programming, and is guaranteed to find the alignment with the maximum score. The Needleman-Wunsch algorithm is an example of dynamic programming, a discipline invented by Richard Bellman (an American mathematician) in 1953

28 Local alignment Local alignment methods find related regions within sequences - they can consist of a subset of the characters within each sequence. For example, positions of sequence A might be aligned with positions 50-70 of sequence B. This is a more flexible technique than global alignment and has the advantage that related regions which appear in a different order in the two proteins (which is known as domain shuffling) can be identified as being related. This is not possible with global alignment methods.

29 The Smith Waterman algorithm
The Smith-Waterman algorithm (1981) is for determining similar regions between two nucleotide or protein sequences. Smith-Waterman is also a dynamic programming algorithm and improves on Needleman-Wunsch. As such, it has the desirable property that it is guaranteed to find the optimal local alignment with respect to the scoring system being used (which includes the substitution matrix and the gap-scoring scheme). However, the Smith-Waterman algorithm is demanding of time and memory resources: in order to align two sequences of lengths m and n, O(mn) time and space are required.

30 Global vs. Local Alignments
Global alignment algorithms start at the beginning of two sequences and add gaps to each until the end of one is reached. Local alignment algorithms finds the region (or regions) of highest similarity between two sequences and build the alignment outward from there.

31 Statistical analysis of alignments
This works identical to gene finding: * Generate randomized sequences based on the second string * Determine the optimal alignments of the first sequence with these randomized sequences * Compute a histogram and rank the observed score in this histogram

32 The Needleman-Wunsch algorithm
A smart way to reduce the massive number of possibilities that need to be considered, yet still guarantees that the best solution will be found (Saul Needleman and Christian Wunsch, 1970). The basic idea is to build up the best alignment by using optimal alignments of smaller subsequences.

33 Needleman & Wunsch Place each sequence along one axis
Place score 0 at the up-left corner Fill in 1st row & column with gap penalty multiples Fill in the matrix with max value of 3 possible moves: Vertical move: Score + gap penalty Horizontal move: Score + gap penalty Diagonal move: Score + match/mismatch score The optimal alignment score is in the lower-right corner To reconstruct the optimal alignment, trace back where the max at each step came from, stop when hit the origin.

34 Example AAAC AAAC A-GC -AGC C A G -6 -4 -2 -8 -6 -4 -2 1 -1 -3 -5 -1
Let gap = -2 match = 1 mismatch = -1. C A empty G -6 -4 -2 -8 -6 -4 -2 1 -1 -3 -5 -1 -2 -4 -3 -2 -1 -1 AAAC A-GC AAAC -AGC

35 Local Alignment Problem first formulated: Problem: Algorithm:
Smith and Waterman (1981) Problem: Find an optimal alignment between a substring of s and a substring of t Algorithm: is a variant of the basic algorithm for global alignment

36 Smith & Waterman Place each sequence along one axis
Place score 0 at the up-left corner Fill in 1st row & column with 0s Fill in the matrix with max value of 4 possible values: Vertical move: Score + gap penalty Horizontal move: Score + gap penalty Diagonal move: Score + match/mismatch score The optimal alignment score is the max in the matrix To reconstruct the optimal alignment, trace back where the MAX at each step came from, stop when a zero is hit

37 Local Alignment match = 1 mismatch = -1. 1 1 2 1 3 1 1 1 2 2 2 1 3 1 1
Let gap = -2 match = 1 mismatch = -1. GATCACCT GATACCC GATCACCT GAT _ ACCC C A T G empty 1 1 2 1 3 1 1 1 2 2 2 1 3 1 1 1 2 4 2 1 2 3 3

38 Pairwise alignment: the solution
”Dynamic programming” (the Needleman-Wunsch algorithm)

39 Alignment depicted as path in matrix
T C G C A T C A TCGCA TC-CA T C G C A T C A TCGCA T-CCA

40 Alignment depicted as path in matrix
T C G C A T C A Meaning of point in matrix: all residues up to this point have been aligned (but there are many different possible paths). x Position labeled “x”: TC aligned with TC --TC -TC TC TC-- T-C TC

41 Creation of an alignment path matrix
If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j) Three possibilities: xi and yj are aligned, F(i,j) = F(i-1,j-1) + s(xi ,yj) xi is aligned to a gap, F(i,j) = F(i-1,j) - d yj is aligned to a gap, F(i,j) = F(i,j-1) - d The best score up to (i,j) will be the largest of the three options

42 Dynamic programming: computation of scores
T C G C A T C A Any given point in matrix can only be reached from three possible previous positions (you cannot “align backwards”). => Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities. x

43 Dynamic programming: computation of scores
T C G C A T C A Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”). => Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities. x score(x,y-1) - gap-penalty score(x,y) = max

44 Dynamic programming: computation of scores
T C G C A T C A Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”). => Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities. x score(x,y-1) - gap-penalty score(x-1,y-1) + substitution-score(x,y) score(x,y) = max

45 Dynamic programming: computation of scores
T C G C A T C A Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”). => Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities. x score(x,y-1) - gap-penalty score(x-1,y-1) + substitution-score(x,y) score(x-1,y) - gap-penalty score(x,y) = max

46 Dynamic programming: computation of scores
T C G C A T C A Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”). => Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities. x Each new score is found by choosing the maximum of three possibilities. For each square in matrix: keep track of where best score came from. Fill in scores one row at a time, starting in upper left corner of matrix, ending in lower right corner. score(x,y-1) - gap-penalty score(x-1,y-1) + substitution-score(x,y) score(x-1,y) - gap-penalty score(x,y) = max

47 Dynamic programming: example
A C G T A C G T Gaps: -2

48 Dynamic programming: example

49 Dynamic programming: example

50 Dynamic programming: example

51 Dynamic programming: example

52 Dynamic programming: example
T C G C A : : : : T C - C A = 2

53 Global versus local alignments
Global alignment: align full length of both sequences (The “Needleman-Wunsch” algorithm). Local alignment: find best partial alignment of two sequences (the “Smith-Waterman” algorithm). Global alignment Seq 1 Local alignment Seq 2

54 Local alignment overview
The recursive formula is changed by adding a fourth possibility: zero. This means local alignment scores are never negative. Trace-back is started at the highest value rather than in lower right corner Trace-back is stopped as soon as a zero is encountered score(x,y-1) - gap-penalty score(x-1,y-1) + substitution-score(x,y) score(x-1,y) - gap-penalty score(x,y) = max

55 Local alignment: example

56 Alignments: things to keep in mind
“Optimal alignment” means “having the highest possible score, given substitution matrix and set of gap penalties”. This is NOT necessarily the biologically most meaningful alignment. Specifically, the underlying assumptions are often wrong: substitutions are not equally frequent at all positions, affine gap penalties do not model insertion/deletion well, etc. Pairwise alignment programs always produce an alignment - even when it does not make sense to align sequences.


Download ppt "Arun Goja MITCON BIOPHARMA"

Similar presentations


Ads by Google