Presentation on theme: "Chap. 3: Sequence Alignment"— Presentation transcript:
1 Chap. 3: Sequence Alignment ExampleMitochondrial cytochrome b – transport electronsFrom NCBI protein web page, search for cytb andLoxodonta africana (African elephant)Elephas maximus (Indian elephant)Mammuthus primigenius (Siberian wooly Mammoth)Which modern elephant is closer to a mammoth ?Use clustalW to do the alignment
2 >0012AAX12542.1| cytochrome b [Elephas maximus] MTHTRKSHPLFKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTMTAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILILGLMPLLHTSKHRSMMLRPLSQVLFWALTMDLLMLTWIGSQPVEYPYIAIGQMASILYFSIILAFLPIAGMIENYLIK>gi| |gb|AAW | cytochrome b [Loxodonta africana]MTHIRKSYPLLKIINKSFIDLPTPSNISAWWNFGSLLGACLITQILTGLFLAMHYTPDTMTAFSSMSHICWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFALHFILPFTMTALAGVHLTFLHETGSNNPLGLRSVPNKLGGVLALFLSILILGLMPLLHTSKYRSMMLRPLSQVLFWTLTMDLLMLTWIGSQPVEYPYTII>gi| |dbj|BAA | cytochrome b [Mammuthus primigenius]MTHIRKSHPLLKILNKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTMTAFSSMSHICWGATVITNLFSAIPYIGTDLVEWIWGGFSVDKATLNRFFALHFILPFTMIALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILFLLLLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILILGIMPLLHTSKHRSMMLRPLSQVLFWTLATDLLMLTWIGSQPVEYPYIII
3 Pairwise sequence alignment is the most fundamental operation of bioinformatics It is used to decide if two proteins (or genes) are related structurally or functionallyIt is used to identify domains or motifs that are shared among proteinsIt is the basis of BLAST searching (next)It is used in the analysis of genomes
4 Globin Globins carry oxygens and are first proteins to be sequenced Hemoglobins – in read blood cellMyoglobin – in muscle cells of mammalsLeghemoglobin – in legumes (beans, etc.)
7 Similarity and Homology Observation or measurement of resemblance, independent of the source of the resemblanceCan be observed now but involves no historical hypothesisHomologySpecifies that sequences and the organisms descended from a common ancestorImplies that similarities are shared ancestral characteristicsCannot make the assertion of homology from historical evidence, and thus is an inference from observations of similarity
8 Homology Similarity attributed to descent from a common ancestor Two types of homologyOrthologsHomologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function.ParalogsHomologous sequences within a single species that arose by gene duplication.
9 Orthologs:members of agene (protein)family in variousorganisms.This tree showsglobin orthologs.
10 Paralogs: members of a gene (protein) family within a species. This tree shows human globin paralogs.
11 Orthologs and paralogs are often viewed in a single tree
13 Globin phylogeny by Dayhoff in evolutionary time (1972)
14 Direct Alignment Given two sequences +1 if letters in the same positions match-1, otherwiseExtremely simple, but what if there is a gap?Gap when a base is inserted or deleted (indel)Maybe only in biological dataMaybe more significant mutation – give more negative score as a penaltyRNDKPFSTARNRNQKPKWWTA
15 Visual Alignment -- Dotplot A seq. in x axis and the other in y axisDot on a crosspoint if identical in both sequencesview
17 Sequence Alignment Direct alignment An alignment with gaps What is the criteria for a good alignment ?Use score to check for optimalityMay not produce a unique optimal alignmentg c t g a a c gc t a t a a t cg c t g - a a - c g- c t a t a a t c -g c t g - a a - c - g- - c t - a t a a t c
19 General approach to pairwise alignment Given two sequencesSelect an algorithm that generates a scoreAllow gaps (insertions, deletions)Score reflects degree of similarityAlignments can be global or localEstimate probability that the alignment occurred by chance
20 Pairwise alignment: protein sequences can be more informative than DNA protein is more informative (20 vs 4 characters); many amino acids share related biophysical propertiescodons are degenerate: changes in the third position often do not alter the amino acid that is specifiedprotein sequences offer a longer “look-back” timeDNA sequences can be translated into protein, and then used in pairwise alignmentsMany times, DNA alignments are appropriate whento confirm the identity of a cDNAto study noncoding regions of DNAto study DNA polymorphismsexample: Neanderthal vs modern human DNA
22 Scoring Matrix Dotplot A numerical method Scoring matrices Incredibly useful in identifying biological significance and interesting regionsDo not privde a measure of statistical similarityA numerical methodNot just provide position-by-position overlapBut provide the nature and characteristics of residues being alignedScoring matricesEmpirical weighting schemes
23 Scoring MatrixThree biological factors in constructing a scoring matrixConservationAccount for conservation between proteins, but provide a way to assess conservation substitutionsScore represents what residues are capable of substitution for other residues while not adversely affecting the function of the native protein (determined by charge, size, hydrophobicity, etc.)FrequencyReflect how often residues occur among proteinsRare residues are given more weightEvolutionBy design, implicitly represent evolutionary patternsReview
24 Scoring Matrix Log-Odds Score score qij : prob. of how often i and j are seen alignedpi: prob. of observing AA I among all proteinssij = log(qij/ pipj)scoreRepresent the ratio of observed versus random frequency of substitutign i by jPositive score – two residues are replaced more often than by chanceNegative – less likely to substitute than by chance
25 Scoring MatrixNucleotidesAAsMore complicated in 20x20
26 Other Scores Gap penalty Gap initiation and extension Clustal-W recommends use of identity matrixFor DNA sequences1 for a match, 0 for a mismatch, gap penalty of 10 for initiation and 0.1 for extension per residueFor AA sequencesBLOSUM62 matrix for substitution, gap penalty of 11 for initiation and 1 for extension per residuea a a g a a aa a a – a a aa a a g g g a a aa a a a a a
27 Pairwise Alignment: Global and Local Given a scoring scheme, find alignments maximizing the scoreGlobalEntire sequence of protein or DNA sequenceNeedleman and Wunsch (dynamic programming)LocalFocus on regions of greatest similaritySmith and WatermanIn general, preferable to Global AlignmentBecause only portions of proteins align
29 Dynamic Programming Guaranteed to yield an optimal global alignment Drawback – many alignments may give the same optimal score and none of them may correspond to biologically correct alignmentW.Fitch and T.Smith found 17 alignments of alpha- and beta-chains of chicken haemoglobin, one of which is correct based on structuresDrawback – complexity O(nm) for sequences of length n and m
30 Dynamic Programming Rock removal game Two piles of rocks, each with 10 rocksA and B alternatively remove one rock from a single pile or one rock each from both pilesPlayer who remove the last rock(s) wins the gameUse reduction strategy starting with smaller problemsConsider 2+2 problemA removes one rock each, B removes one rock eachA removes one rock, B takes one rock from the same pileB wins3+3 problem ?
31 Rock Removal with 10+10 ↑ A takes one from pile X ← A takes one from pile YA takes one from each pile* A will lose
32 Manhattan Tourist Problem Visit as many tourist sites in a Manhattan gridMove to the east or south onlyStart at upper left cornerEnd at # 15, lower right corner
33 Problem StatementGiven a weighted grid G with two vertices (nodes) for a source and a sinkFind the longest path in a weighted gridWeight: # of attraction sites on an edge (link)Each vertex (node) can be identified by (i,j)Source at (0,0)Sink at (n, m)324124324465273445233
34 Solution(0,0)Define si,j: the longest path from source to vertex (i,j) (0 ≤ i < n, 0 ≤ j < m)Solve for smaller problems firstSolving for s0,j and si,0 is easy324359124324146527354452339
35 Solution (2)(0,0)(0,1)324Iteratively solve for neighboring nodessi,1si,2, etc.359124(1,0)324144652(2,0)735104452(3,0)33914si,j = max[si-1,j + weight on edge between (i-1,j) and (i,j),si,j-1 + weight on edge between (i,j-1) and (i,j)]
36 Algorithm Algorithm Given Weast(i,j) and Wsouth(i,j), s0,0 = 0 for i =1 to nsi,0 = si-1,0 + Wsouth(i,0)for j =1 to ns0,j = s0,j-1 + Weast(0,j)for j = 1 to msi,j = max[si-1,j + Wsouth(i,j),si,j-1 + Weast(i,j)]return sn,m
37 General Graph ProblemNot regular with two inputs (indegree) and two outputs (outdegree) at a node
38 Directed Acyclic Graph DAG: Directed Acyclic GraphG = (V, E)Longest Path Problemsv = max(su + weight from u to v) over all u which are Predecessor(v)Predecessor relationship has to be established ahead of the timeu1573u2v5u3
39 Graph Problem applied to Alignment Measure of similarityHamming distance: equal-length sequencesLevenshtein or edit distance, 1966unequal-length sequenceMin. # of ‘edit operations’ (insertion, deletion, alteration of a single character in either sequence) required to change one string into the othere.g.Levenshtein distance = 3a g – t c cc g c t c a
40 Edit Distance and Alignment Two strings, v and wGaps are allowed in string, except that two gaps are not allowed at the same char positionsEach char in a string is represented by positions in the original string without gapsv: ( )w: ( )For both strings,(00) (11) (22) (23) (34) (45) (55) (66) (76) (77)Represents a path in a gridA T - G T T A T -A T C G T - A - G
41 Edit Distance Vertex (i,j) corresponds to (ij) for (vi, wj) G = (V, E) Longest Path Problemsv = max(su + weight from u to v) over all u, Predecessor(v)Predecessor relationship has to be established ahead of the time
42 Global AlignmentA string has a sequence of characters drawn from an alphabet A of size kScoring matrix, δ, of (k+1)x(k+1)Problem StatementGiven two strings, v and w, and a scoring matrix δ,Find the longest (max. score) pathDynamic programming kernelRecurrence relationshipsi-1, j δ(vi, -)si, j = max [ si, j δ(-, wj) ]si-1, j-1 + δ(vi, wj)
43 Global Alignment Example of scoring matrix Match: +1; mismatch: -μ; indels: -σIndels are frequent, and gap penalties proportional to indel sizes are considered to be severeAffine gap penalties soften the penalty rateCan be linear, -(a + bx) for the indel length of xsi-1, j σsi-1, j = max [ si, j σ ]si-1, j , if vi=wjsi-1, j μ, otherwise
49 Local AlignmentGlobal sequence alignment is useful for alignment of sequences from the same protein family, for exampleSubstrings from two sequences may be highly conserved in biological applicationsTemple Smith and Michael Waterman, 1981Biologically irrelevant diagonal matches are likely to have a higher score
50 Local Alignment Problem Given two strings v and w, and a scoring matrix δFind substrings of v and w whose global alignment is maximal among all substrings of v and wSeemingly harder, because the global alignment is to find the longest path from (0,0) to (n,m), whereas the local alignment is to find the longest path among all paths between two arbitrary points, (i,j) to (i’, j’)Add edges of weight 0 from (0,0) toevery other vertex (vertex (0,0) is apredecessor of every vertex
51 Local Alignment Solution Recurrence kernel becomesSelect the largest si, jOther non-maximal local alignments may have biological significanceSelect k best nonoverlapping local alignmentssi-1, j + δ(vi, -)si, j = max [ si, j-1 + δ(-, wj) ]si-1, j-1 + δ(vi, wj)
Your consent to our cookies if you continue to use this website.