Download presentation

Presentation is loading. Please wait.

Published byCortez Wormwood Modified about 1 year ago

1
Example Mitochondrial cytochrome b – transport electrons From NCBI protein web page, search for cytb and Loxodonta africana (African elephant) Elephas maximus (Indian elephant) Mammuthus primigenius (Siberian wooly Mammoth) Which modern elephant is closer to a mammoth ? Use clustalW to do the alignment Chap. 3: Sequence Alignment

2
>0012AAX | cytochrome b [Elephas maximus] MTHTRKSHPLFKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTMTAFSSMSHIC RDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSF WGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLG LTSDSDKIPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAI LRSVPNKLGGVLALLLSILILGLMPLLHTSKHRSMMLRPLSQVLFWALTMDLLMLTWIGSQPVEYPYIAI GQMASILYFSIILAFLPIAGMIENYLIK >gi| |gb|AAW | cytochrome b [Loxodonta africana] MTHIRKSYPLLKIINKSFIDLPTPSNISAWWNFGSLLGACLITQILTGLFLAMHYTPDTMTAFSSMSHIC RDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSF WGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFALHFILPFTMTALAGVHLTFLHETGSNNPLG LTSDSDKIPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAI LRSVPNKLGGVLALFLSILILGLMPLLHTSKYRSMMLRPLSQVLFWTLTMDLLMLTWIGSQPVEYPYTII GQMASILYFSIILAFLPIAGMIENYLIK >gi| |dbj|BAA | cytochrome b [Mammuthus primigenius] MTHIRKSHPLLKILNKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTMTAFSSMSHIC RDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSF WGATVITNLFSAIPYIGTDLVEWIWGGFSVDKATLNRFFALHFILPFTMIALAGVHLTFLHETGSNNPLG LTSDSDKIPFHPYYTIKDFLGLLILILFLLLLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAI LRSVPNKLGGVLALLLSILILGIMPLLHTSKHRSMMLRPLSQVLFWTLATDLLMLTWIGSQPVEYPYIII GQMASILYFSIILAFLPIAGMIENYLIK

3
Pairwise sequence alignment is the most fundamental operation of bioinformatics It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify domains or motifs that are shared among proteins It is the basis of BLAST searching (next) It is used in the analysis of genomes

4
Globin Globins carry oxygens and are first proteins to be sequenced Hemoglobins – in read blood cell Myoglobin – in muscle cells of mammals Leghemoglobin – in legumes (beans, etc.)

5
Globin

6
(a)Myoglobin (b)Tetrameric hemoglobin (c)Beta globin subunit (d)Myoglobin & beta globin

7
Similarity and Homology Similarity Observation or measurement of resemblance, independent of the source of the resemblance Can be observed now but involves no historical hypothesis Homology Specifies that sequences and the organisms descended from a common ancestor Implies that similarities are shared ancestral characteristics Cannot make the assertion of homology from historical evidence, and thus is an inference from observations of similarity

8
Homology Similarity attributed to descent from a common ancestor Two types of homology Orthologs Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function. Paralogs Homologous sequences within a single species that arose by gene duplication.

9
Orthologs: members of a gene (protein) family in various organisms. This tree shows globin orthologs.

10
Paralogs: members of a gene (protein) family within a species. This tree shows human globin paralogs.

11
Orthologs and paralogs are often viewed in a single tree

12
Globin phylogeny by Dayhoff (1972)

13
Globin phylogeny by Dayhoff in evolutionary time (1972)

14
Direct Alignment Given two sequences +1 if letters in the same positions match -1, otherwise Extremely simple, but what if there is a gap? Gap when a base is inserted or deleted (indel) Maybe only in biological data Maybe more significant mutation – give more negative score as a penalty RNDKPFSTARN RNQKPKWWTA

15
Visual Alignment -- Dotplot A seq. in x axis and the other in y axis Dot on a crosspoint if identical in both sequences view

16
Special Dotplot PeriodicPalindrome

17
Sequence Alignment Direct alignment An alignment with gaps What is the criteria for a good alignment ? Use score to check for optimality May not produce a unique optimal alignment g c t g a a c g c t a t a a t c g c t g - a a - c - g - - c t - a t a a t c g c t g - a a - c g - c t a t a a t c -

18
Calculation of an alignment score

19
General approach to pairwise alignment Given two sequences Select an algorithm that generates a score Allow gaps (insertions, deletions) Score reflects degree of similarity Alignments can be global or local Estimate probability that the alignment occurred by chance

20
Pairwise alignment: protein sequences can be more informative than DNA protein is more informative (20 vs 4 characters); many amino acids share related biophysical properties codons are degenerate: changes in the third position often do not alter the amino acid that is specified protein sequences offer a longer “look-back” time DNA sequences can be translated into protein, and then used in pairwise alignments Many times, DNA alignments are appropriate when to confirm the identity of a cDNA to study noncoding regions of DNA to study DNA polymorphisms example: Neanderthal vs modern human DNA

21
Genetic Code

22
Scoring Matrix Dotplot Incredibly useful in identifying biological significance and interesting regions Do not privde a measure of statistical similarity A numerical method Not just provide position-by-position overlap But provide the nature and characteristics of residues being aligned Scoring matrices Empirical weighting schemes

23
Scoring Matrix Three biological factors in constructing a scoring matrix Conservation Account for conservation between proteins, but provide a way to assess conservation substitutions Score represents what residues are capable of substitution for other residues while not adversely affecting the function of the native protein (determined by charge, size, hydrophobicity, etc.) Frequency Reflect how often residues occur among proteins Rare residues are given more weight Evolution By design, implicitly represent evolutionary patterns Review Fl2kBl5QBq7VIoy-eDgDqXhaZ14#v=onepage&q&f=false

24
Scoring Matrix Log-Odds Score q ij : prob. of how often i and j are seen aligned p i : prob. of observing AA I among all proteins s ij = log(q ij / p i p j ) score Represent the ratio of observed versus random frequency of substitutign i by j Positive score – two residues are replaced more often than by chance Negative – less likely to substitute than by chance

25
Scoring Matrix Nucleotides AAs More complicated in 20x20

26
Other Scores Gap penalty Gap initiation and extension Clustal-W recommends use of identity matrix For DNA sequences 1 for a match, 0 for a mismatch, gap penalty of 10 for initiation and 0.1 for extension per residue For AA sequences BLOSUM62 matrix for substitution, gap penalty of 11 for initiation and 1 for extension per residue a a a g a a a a a a – a a a a a a g g g a a a a a a a a a

27
Pairwise Alignment: Global and Local Given a scoring scheme, find alignments maximizing the score Global Entire sequence of protein or DNA sequence Needleman and Wunsch (dynamic programming) Local Focus on regions of greatest similarity Smith and Waterman In general, preferable to Global Alignment Because only portions of proteins align

28
Global and Local in Dotplot

29
Dynamic Programming Guaranteed to yield an optimal global alignment Drawback – many alignments may give the same optimal score and none of them may correspond to biologically correct alignment W.Fitch and T.Smith found 17 alignments of alpha- and beta-chains of chicken haemoglobin, one of which is correct based on structures Drawback – complexity O(nm) for sequences of length n and m

30
Dynamic Programming Rock removal game Two piles of rocks, each with 10 rocks A and B alternatively remove one rock from a single pile or one rock each from both piles Player who remove the last rock(s) wins the game Use reduction strategy starting with smaller problems Consider 2+2 problem A removes one rock each, B removes one rock each A removes one rock, B takes one rock from the same pile B wins 3+3 problem ?

31
Rock Removal with ↑ A takes one from pile X ← A takes one from pile Y A takes one from each pile * A will lose

32
Manhattan Tourist Problem Visit as many tourist sites in a Manhattan grid Move to the east or south only Start at upper left corner End at # 15, lower right corner

33
Problem Statement Given a weighted grid G with two vertices (nodes) for a source and a sink Find the longest path in a weighted grid Weight: # of attraction sites on an edge (link) Each vertex (node) can be identified by (i,j) Source at (0,0) Sink at (n, m)

34
Solution Define s i,j : the longest path from source to vertex (i,j) (0 ≤ i < n, 0 ≤ j < m) Solve for smaller problems first Solving for s 0,j and s i,0 is easy (0,0)

35
Solution (2) Iteratively solve for neighboring nodes s i,1 s i,2, etc. s i,j = max[s i-1,j + weight on edge between (i-1,j) and (i,j), s i,j-1 + weight on edge between (i,j-1) and (i,j)] (0,0) (1,0) (2,0) (3,0) (0,1)

36
Algorithm Given W east (i,j) and W south (i,j), s 0,0 = 0 for i =1 to n s i,0 = s i-1,0 + W south (i,0) for j =1 to n s 0,j = s 0,j-1 + W east (0,j) for i =1 to n for j = 1 to m s i,j = max[s i-1,j + W south (i,j), s i,j-1 + W east (i,j)] return s n,m

37
General Graph Problem Not regular with two inputs (indegree) and two outputs (outdegree) at a node

38
Directed Acyclic Graph DAG: Directed Acyclic Graph G = (V, E) Longest Path Problem s v = max(s u + weight from u to v) over all u which are Predecessor(v) Predecessor relationship has to be established ahead of the time v u1u1 u2u2 u3u3

39
Graph Problem applied to Alignment Measure of similarity Hamming distance: equal-length sequences Levenshtein or edit distance, 1966 unequal-length sequence Min. # of ‘edit operations’ (insertion, deletion, alteration of a single character in either sequence) required to change one string into the other e.g. Levenshtein distance = 3 a g – t c c c g c t c a

40
Edit Distance and Alignment Two strings, v and w Gaps are allowed in string, except that two gaps are not allowed at the same char positions Each char in a string is represented by positions in the original string without gaps v: ( ) w: ( ) For both strings, ( 0 0 ) ( 1 1 ) ( 2 2 ) ( 2 3 ) ( 3 4 ) ( 4 5 ) ( 5 5 ) ( 6 6 ) ( 7 6 ) ( 7 7 ) Represents a path in a grid A T - G T T A T - A T C G T - A - G

41
Edit Distance Vertex (i,j) corresponds to ( i j ) for (v i, w j ) G = (V, E) Longest Path Problem s v = max(s u + weight from u to v) over all u, Predecessor(v) Predecessor relationship has to be established ahead of the time

42
Global Alignment A string has a sequence of characters drawn from an alphabet A of size k Scoring matrix, δ, of (k+1)x(k+1) Problem Statement Given two strings, v and w, and a scoring matrix δ, Find the longest (max. score) path Dynamic programming kernel Recurrence relationship s i-1, j + δ (v i, -) s i, j = max [ s i, j-1 + δ (-, w j ) ] s i-1, j-1 + δ (vi, w j )

43
Global Alignment Example of scoring matrix Match: +1; mismatch: -μ; indels: -σ Indels are frequent, and gap penalties proportional to indel sizes are considered to be severe Affine gap penalties soften the penalty rate Can be linear, -(a + bx) for the indel length of x s i-1, j - σ s i-1, j = max [ s i, j-1 - σ ] s i-1, j-1 + 1, if vi=w j s i-1, j-1 - μ, otherwise

44
Needleman-Wunsch, 1970 Setting up a matrix

45

46
Scoring the matrix

47

48
Identifying the optimal alignment

49
Local Alignment Global sequence alignment is useful for alignment of sequences from the same protein family, for example Substrings from two sequences may be highly conserved in biological applications Temple Smith and Michael Waterman, 1981 Biologically irrelevant diagonal matches are likely to have a higher score

50
Local Alignment Problem Given two strings v and w, and a scoring matrix δ Find substrings of v and w whose global alignment is maximal among all substrings of v and w Seemingly harder, because the global alignment is to find the longest path from (0,0) to (n,m), whereas the local alignment is to find the longest path among all paths between two arbitrary points, (i,j) to (i’, j’) Add edges of weight 0 from (0,0) to every other vertex (vertex (0,0) is a predecessor of every vertex

51
Local Alignment Solution Recurrence kernel becomes Select the largest s i, j Other non-maximal local alignments may have biological significance Select k best nonoverlapping local alignments s i-1, j + δ (v i, -) s i, j = max [ s i, j-1 + δ (-, w j ) ] s i-1, j-1 + δ (vi, w j ) 0

52

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google