Download presentation
Presentation is loading. Please wait.
Published byCandice McDowell Modified over 7 years ago
1
Dynamic programming: one algorithmic key to many biological locks
Mikhail Gelfand RTCB, IITP, RA S and FBB, MSU
2
BIOINFORMATICS FOR BIOLOGISTS Pavel Pevzner and Ron Shamir, eds.
(Cambridge University Press, 2011) Ch. 4. DYNAMIC PROGRAMMING: ONE ALGORITHMIC KEY FOR MANY BIOLOGICAL LOCKS Mikhail Gelfand Research and Training Center “Bioinformatics” of the Institute for Information Transmission Problems, RAS and Faculty of Bioengineering and Bioinformatics, M.V.Lomonosov Moscow State University
3
Alignment Three (of many) alignments of two sequences.
Plus denotes a match; dot, a mismatch, minus, a gap. (a) Two matches, five mismatches, (b) three matches, one mismatch, two gaps of size three (six indels, that is one-nucleotide insertions/deletions), (c) four matches, two gaps of size three (six indels).
4
The number of alignments is large
# of alignments of two sequences of length N ~ (1+√2)2N+1√N at N = 1000 # ≈ # of elementary particles in the Universe ≈ 1080 at N = 100 # ≈ 1076 assume 1 operation per alignment, 1012 operations per second => need 1057 years => we cannot consider them one by one
5
Gene recognition Segmentation of a genomic fragment into protein-coding and non-coding regions based on differences in statistical properties of these regions difficult in eukaryotes due to the existence of introns, non-coding regions within genes
6
Toy example = x1∙y1 + x1∙y2 + … + x1∙yn +
How many operations are needed to calculate ∑i=1…m, j=1…n xi∙yj = = x1∙y1 + x1∙y2 + … + x1∙yn + + x2∙y1 + x2∙y2 + … + x2∙yn + + … + + xm∙y1 + xm∙y2 + … + xm∙yn Naïve answer: mn multiplications and mn–1 additions
7
but rewrite as… (x1 + x2 + … + xm) ∙ (y1 + y2 + … + yn) = = ∑i=1…m xi ∙ ∑j=1…n yj and it becomes m+n–2 additions and just 1 multiplication
8
Quiz How many multiplications do we need to calculate
x1y1 ∙ x1y2 ∙ … ∙ x1yn ∙ x2y1 ∙ x2y2 ∙ … ∙ x2yn ∙ … ∙ ∙ xmy1 ∙ xmy2 ∙ … ∙ xmyn = ∏ i=1…m, j=1…n xiyj if we are naïve? (b) sophisticated? (c) What if in addition to multiplication, we have an operation “taking to the power”? (d) if we may perform not only multiplication, but also addition?
9
Lesson Restructuring the order of calculations using properties of the data may sharply decrease the number of operations
10
Graphs Vertices/nodes: v1, v2, …, vn Arcs /edges– directed pairs of vertices: am(vi, vj) multiple sources and sinks contains cycles
11
“bad” graphs and not graphs
multiple arcs multiple components loop not a graph (hanging arc) undirected graph
12
Sources, sinks, paths, cycles
Source is a vertex that is not an end vertex for any arc Sink is a vertex that is not a start vertex for any arc. Walk p of length N is an ordered set of N arcs w = (a1, …, aN) such that the end vertex of arc an = (bn, en) coincides with the start vertex of arc an+1, en=bn+1, for all n = 1, …, N–1. v1 v1 v2 v3 v1 v2 v2 v4 v5 v6 v3 v4 one source and one sink multiple sources and sinks no source and sink w=(a(v1,v3),a(v3,v2,),a(v2,v4,),a(v3,v4), a(v3,v1), a(v1,v3)) w=(a(v2,v1)) w=(a(v4,v5),a(v5,v3))
13
Sources, sinks, paths, cycles
In a graph without loops and multiple arcs, each walk may also be defined as an ordered set of vertices w = (v1, …, vN+1) such that for each pair of adjacent vertices vn, vn+1 there is an arc an = (vn, vn+1), n = 1, …, N. v1 v1 v2 v3 v1 v2 v2 v4 v5 v6 v3 v4 one source and one sink multiple sources and sinks no source and sink w=(v2,v1) w=(v4,v5,v3) w=(v1,v3,v2,v4,v3,v1,v3)
14
Sources, sinks, paths, cycles
A path is a walk in which no arc is passed twice. Cycle is a path in which the end vertex of the last arc aN coincides with the start vertex of the first arc a1, eN=b1. Acyclic graph contains no cycles. Acyclic graph Acyclic graph Cyclic graph v1 v1 v2 v3 v1 v2 v2 v4 v5 v6 v3 v4 one source and one sink multiple sources and sinks no source and sink p=(v2,v1) p=(v4,v5,v3) p=c=(v1,v3,v2,v4,v3,v1)
15
Quiz (a) Draw all acyclic connected oriented graphs with three vertices (up to vertex labels). (b) How many oriented graphs will there be if we label vertices with symbols A, B and C? (c) Prove that in an acyclic graph there is at least one source and at least one sink. (d) Draw sinks and sources in the graphs of (a).
16
Problem Consider an acyclic graph with one source and one sink. Assign each arc with a number called a weight. For a given path, its path score is defined as the sum of the weights of its arcs. Given a weighted acyclic graph, find the highest scoring path from the sink to the source.
17
Observation If two subpaths P and Q end at the same vertex v, and the score of P is larger than the score of Q, then for all pairs of paths P* and Q* that start with P and Q, respectively, and coincide after v, the score of P* is higher than the score of Q*. Hence, we do not need to consider all paths, as it is sufficient to construct the highest scoring subpath from the source to each vertex, finishing at the sink. Q P v P*,Q* P* > Q* P > Q
18
Let’s do it for this graph
3 1 3 2 6 8 5 2 6 5 5 2 1 3 4 1 4 2 2 1
19
2 4 1 3 6 5 8 Step 1 Step 2
20
2 4 1 3 6 5 8 Step 3 10 Step 4 7 11
21
2 4 1 3 6 5 8 10 Step 5 7 11 12 Step 6 18 16
22
2 4 1 3 6 5 8 10 Step 7 7 11 18 16 Step 8 19
23
2 4 1 3 6 5 8 10 Step 9 7 11 19 16 20 Backtracing
24
Quiz At what steps did we have more than one vertex with all incoming arcs processed?
25
Algorithm Data types and definitions: vertices: v, u, Source, Sink;
arcs: (v,u), a; start vertex of arc a: Begining_vertex(a); weight of arc (v,u): W(v,u); path: BestPath; // defined as a set of arcs the highest score of subpath ending at v: Score (v); the highest score of subpath coming through (v,u) and ending at u : Top_score (v,u); the last arc of the highest scoring subpath ending at u: Last_arc(u).
26
Initialize: for each vertex v: Score (v) := minus_infinity.
Forward process: while There are unprocessed vertices: v := arbitrary unprocessed vertex with all incoming arcs processed; for each arc (v,u): // consider all arcs starting at v Top_score (v,u) := Score (v)+W(v,u); if Top_score (v,u)>Score (u) // subpath coming through v is better than the //current best subpath ending at u then: // update the data for u Score (u) := Top_score (v,u); Last_acr (u) := (v,u); endif; (v,u) := processed_arc; endfor; v := processed_vertex; endwhile. Backtracing: BestPath = empty_set; // initialize v := Sink; // go from the sink backwards by marked arcs until v=Source Add Last_arc (v) to BestPath; // add the last arc of the best path ending at the //current vertex v := Beging_vertex (Last_arc(v)); // go to the start vertex of this arc enduntil. Output BestPath.
27
The number of operations
The limiting procedure is processing vertices and adding arcs to paths, and we consider each arc only once Hence the number of operations is linear in the number of arcs A: the run time of the algorithm is O (A)
28
Greedy algorithm 2 4 1 3 6 5 8 Start at the source and select the highest-weighted arc at each step. 13 < 20 It does not work.
29
Quiz Construct the simplest possible graph in which the greedy algorithm yields the highest scoring path. (b) Construct a graph with three vertices in which the greedy algorithm does not yield the highest scoring path. (c) Construct a graph with three vertices in which the greedy algorithm does yield the highest scoring path. (d) Assign new weights to the arcs of the above graph so that the greedy algorithm will yield the highest scoring path.
30
Quiz cont’d (e) Write an algorithm for construction of the path with the maximum number of arcs and apply it to the above graph. Hint: do not change the algorithm, set proper arc weights. (f) Modify the maximum score algorithm so as to construct the path with the minimal score and find this path for the above graph. (g) Provide a greedy algorithm for finding the path of minimal score in a graph, and apply it to the above graph. (h) For the above graph, find the path with the minimal number of arcs.
31
Lesson The generic dynamic programming algorithm may be applied to different problems. The common feature of these problems is that each one can be decomposed into an ordered set of smaller subproblems, and to solve a more complex subproblem one needs to know only the solutions of the simpler ones, but not the entire set of possibilities.
32
Note There exist path optimization problems that cannot be solved by the dynamic programming. Traveling salesman problem. Given a non-oriented graph with weighted arcs, we need to construct the lowest scoring path passing through all the vertices (the salesman needs to visit all cities with travel time between the cities given by the arc weights, while spending the least amount of time traveling). All cities need to be visited in a single trip => NP-complete problem. No efficient algorithms are known. Most computer scientists believe that for all NP-complete problems the number of operations required to provide an optimal solution is exponential in the problem size.
33
Alignment Given two symbol sequences (nucleotides or amino acids) of lengths M and N, set a correspondence between these sequences so that some symbols are set in pairs, matching or mismatching, whereas other symbols are ignored (indels). The order of corresponding symbols in the subsequences should coincide. The alignment score is the sum of match premiums r per matching pair minus the sum of mismatch penalties p per mismatching pair and deletion penalties q per ignored symbol. The goal is to construct the highest scoring alignment.
34
Quiz What are the scores of the alignments
35
Reduction to the optimal path problem
Construct a graph. Vertices correpond to pairs of positions (endpoint of partial alignments). Outcoming arcs (for each vertex) are of three types: match (weight r ) or mismatch (weight(–p)); total M∙N arcs deletion in the 1st sequence (weight (–q)); total M∙(N+1) arcs deletion in the 2nd sequence (weight (–q); total (M+1)∙N) arcs
36
Alignment graph g e l a f n d
37
Alignment graph with weights
q g e p l a f n d
38
Paths for the three alignments
d
39
Variants Hanging-end alignment (genome assembly) Local alignment
zero-weight arcs from the source to the top and left “perimeter” and from the right and bottom perimeter to the sink Local alignment zero-weight arcs from the source to all internal vertices and from internal vertices to the sink
40
Weights Amino-acid substitution weight matrices Deletion penalty
evolutionary PAM (sure alignment of closely related proteins, take matrix to the power) BLOSUM (alignment of conservative regions in distantly related proteins) based on physical and chemical properties of residues Deletion penalty affine penalties (opening and extension penalties) Structural alignment as the gold standard
41
Quiz For the above alignments, assuming match premium r=10, what combinations of mismatch and deletion penalties would yield optimal alignments (a), (b), and (c)?
42
Multiple alignment triple cubic graph
etc for K sequences of length N requires O(NK) operations soon becomes unworkable progressive alignment all pairwise alignments, distance matrices guide tree alignment of partial alignment
43
Lesson Weights matter. The same graph with differently assigned arc weights will yield different types of alignment.
44
Gene recognition Define a gene as a sequence fragment consisting of exons and introns. The boundaries between them are donor sites (between exons and introns, usually GT) and acceptor sites (between introns and exons, usually AG). Each exon and intron is assigned a weight, measuring coding affinity (respectively, non-coding affinity) of its sequence. The gene’s score is the sum of weights of constituent exons and introns. The goal is, given a sequence and a set of candidate donor and acceptor sites, construct the highest-scoring exon–intron structure for a gene.
45
Construct a graph actgagactgcagacggacgtacggcactgacgtataagccccacagtccttacgtctga actgagactgcagACGGACGTACGGCACTGACgtataagCCCCACAGTCCTTACgtctga (a) (b)
46
Complexity Assume even distribution of sites (leave out details) => O(L) vertices, O(L2) arcs Can we do better?
47
It makes sense to assume that the segment weights are additive (we assume that for exons anyhow). Then we have just O(L) arcs actgagactgcagacggacgtacggcactgacgtataagccccacagtccttacgtctga actgagactgcagACGGACGTACGGCACTGACgtataagCCCCACAGTCCTTACgtctga (a) (b)
48
Quiz There are two paths in the segment graph that describe exon–intron structures not represented in the exon–intron graph. What are they? What arcs need to be added to the exon–intron graph to represent these structures?
49
Lesson Structure matters. The same problem may be represented by different graphs, and the conceptually simplest representation is not necessarily the most efficient one.
50
Return to the toy problem
calculate the standard trick would not work because x∙z + y∙z = (x + y) ∙ z (before) holds, but (x+z) ∙ (y+z) = x∙y + z generally does not. Quiz. When (x+z) ∙ (y+z) = x∙y + z ?
51
DP, generic statement. 1. Path weights
Let be the operation of calculating the path score S given arc weights W. We require that the associative rule hold Hence we can simply write . The path weight (former S(P) = ) becomes .
52
DP, generic statement. 2. Graph score
Let Ψ be the set of all paths and the operation of selecting the path. We require that possess the associative, commutative rules for combining paths: and . The graph score is define as (for the optimal path problem ) + +
53
DP, generic statement. 3. Transitivity
To use dynamic programming, we need the distribution law and . This is a generalization of the property used for calculating the optimal path: max (x + z, y + z) = max (x, y) + z.
54
DP, algorithm
55
Problem (physics of polymers)
Linear polymer chain of L+1 monomers k = 0, …, L. Each monomer assumes N states σ(k) є {σi | i = 1, …, N}. Energy of interactions between adjacent monomers is defined by an N×N matrix ξ(σi,σj) (measured in the KT units). Chain conformation P is defined by the states of the monomers {σ(0), σ(1), …, σ(L)}. Exponent of energy: S(P) = exp (–E(P)) = = ∏k=1…L exp (–ξ(σ(k–1),σ(k)). Ψ is the set of all conformations. Calculate the partition function of the set of all conformations Ω = ∑PєΨ S(P).
56
Graph construction and reduction to DP
Vertices correspond to monomer states, so that their number is (L+1)∙N+2 (two additional vertices are the source and the sink, corresponding to the virtual start and end of the chain). Arcs link vertices corresponding to adjacent monomers. Arc weights are the interaction energies. Paths through this graph exactly correspond to the chain conformations. is ordinary multiplication, and is addition The path score is the product of arc weights. The total graph score is the sum of these products. Standard DP solves the problem.
57
Quiz How many operations shall we need?
(b) How many operations shall we need if we calculate the partition function directly? (c) Provide an algorithm for calculating the number of paths in a graph. Hint: invent suitable arc weights and reduce to the previous problem. (d) What will Ω be if both and are the operation of taking the maximum?
58
Problem Calculate the minimum energy and the number of conformations with the minimum energy. Arc weights are pairs [1, ξ], with ξ as defined previously. Path scores are pars [n, ε], where ε is the energy, and n is the number of conformations having this energy. When two systems are combined, the resulting energy is the sum of the systems’ energies, whereas the number of states is the product of the numbers of states. Hence solves the problem.
59
Lesson Generalizations are useful
60
Note Not all problems that can be solved by dynamic programming have a simple graph representation. For example, reconstruction of the secondary structure of a RNA molecule given its sequence can be decomposed into simpler, embedded problems and can be solved by a variant of dynamic programming algorithm, but in the language of this paragraph it requires slightly more complicated objects called hypergraphs.
61
Спасибо Mikhail Roytberg Andrei Mironov Anatoly Rubinov Pavel Pevzner
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.