CSE 5290: Algorithms for Bioinformatics Fall 2011

CSE 5290: Algorithms for Bioinformatics Fall 2011
Suprakash Datta Office: CSEB 3043 Phone: ext 77875 Course page: 9/17/2018 CSE 5290, Fall 2011

Last time Dynamic programming algorithms for Sequence alignment (global, local) Next: Divide and conquer algorithms The following slides are based on slides by the authors of our text. 9/17/2018 CSE 5290, Fall 2011

Divide and Conquer Algorithms
Steps: Divide problem into sub-problems Conquer by solving sub-problems recursively. If the sub-problems are small enough, solve them in brute force fashion Combine the solutions of sub-problems into a solution of the original problem (tricky part) 9/17/2018 CSE 5290, Fall 2011

Examples of divide-and-conquer
Merge sort 9/17/2018 CSE 5290, Fall 2011

Divide and Conquer Approach to LCS
Path(source, sink) if(source & sink are in consecutive columns) output the longest path from source to sink else middle ← middle vertex between source & sink Path(source, middle) Path(middle, sink) The only problem left is how to find this “middle vertex”! 9/17/2018 CSE 5290, Fall 2011

Computing Alignment Path Requires Quadratic Memory
Space complexity for computing alignment path for sequences of length n and m is O(nm) We need to keep all backtracking references in memory to reconstruct the path (backtracking) m n 9/17/2018 CSE 5290, Fall 2011

Computing Alignment Score with Linear Memory
Space complexity of computing just the score itself is O(n) We only need the previous column to calculate the current column, and we can then throw away that previous column once we’re done using it 2 n n 9/17/2018 CSE 5290, Fall 2011

Recall: Computing LCS Let vi = prefix of v of length i: v1 … vi
and wj = prefix of w of length j: w1 … wj The length of LCS(vi,wj) is computed by: si, j = max si-1, j si, j-1 si-1, j if vi = wj 9/17/2018 CSE 5290, Fall 2011

Computing Alignment Score: Recycling Columns
Only two columns of scores are saved at any given time memory for column 1 is used to calculate column 3 memory for column 2 is used to calculate column 4 9/17/2018 CSE 5290, Fall 2011

Crossing the Middle Line
We want to calculate the longest path from (0,0) to (n,m) that passes through (i,m/2) where i ranges from 0 to n and represents the i-th row Define length(i) as the length of the longest path from (0,0) to (n,m) that passes through vertex (i, m/2) m/ m n (i, m/2) Prefix(i) Suffix(i) 9/17/2018 CSE 5290, Fall 2011

Crossing the Middle Line
m/ m n (i, m/2) Prefix(i) Suffix(i) Define (mid,m/2) as the vertex where the longest path crosses the middle column. length(mid) = optimal length = max0i n length(i) 9/17/2018 CSE 5290, Fall 2011

Computing Prefix(i) prefix(i) is the length of the longest path from (0,0) to (i,m/2) Compute prefix(i) by dynamic programming in the left half of the matrix store prefix(i) column m/ m 9/17/2018 CSE 5290, Fall 2011

Computing Suffix(i) suffix(i) is the length of the longest path from (i,m/2) to (n,m) suffix(i) is the length of the longest path from (n,m) to (i,m/2) with all edges reversed Compute suffix(i) by dynamic programming in the right half of the “reversed” matrix store suffix(i) column m/ m 9/17/2018 CSE 5290, Fall 2011

Length(i) = Prefix(i) + Suffix(i)
Add prefix(i) and suffix(i) to compute length(i): length(i)=prefix(i) + suffix(i) You now have a middle vertex of the maximum path (i,m/2) as maximum of length(i) i middle point found m/2 m 9/17/2018 CSE 5290, Fall 2011

Finding the Middle Point
m/ m/ m/ m 9/17/2018 CSE 5290, Fall 2011

Finding the Middle Point again
m/ m/ m/ m 9/17/2018 CSE 5290, Fall 2011

And Again 0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m 9/17/2018
CSE 5290, Fall 2011

Time = Area: First Pass On first pass, the algorithm covers the entire area Area = nm 9/17/2018 CSE 5290, Fall 2011

Time = Area: First Pass On first pass, the algorithm covers the entire area Area = nm Computing prefix(i) Computing suffix(i) 9/17/2018 CSE 5290, Fall 2011

Time = Area: Second Pass
On second pass, the algorithm covers only 1/2 of the area Area/2 9/17/2018 CSE 5290, Fall 2011

Time = Area: Third Pass On third pass, only 1/4th is covered. Area/4
9/17/2018 CSE 5290, Fall 2011

Geometric Reduction At Each Iteration
1 + ½ + ¼ (½)k ≤ 2 Runtime: O(Area) = O(nm) 5th pass: 1/16 3rd pass: 1/4 first pass: 1 4th pass: 1/8 2nd pass: 1/2 9/17/2018 CSE 5290, Fall 2011

Is It Possible to Align Sequences in Subquadratic Time?
Dynamic Programming takes O(n2) for global alignment Can we do better? Yes, use Four-Russians Speedup 9/17/2018 CSE 5290, Fall 2011

Partitioning Alignment Grid into Blocks
9/17/2018 CSE 5290, Fall 2011

Block Alignment Block alignment of sequences u and v:
An entire block in u is aligned with an entire block in v An entire block is inserted An entire block is deleted Block path: a path that traverses every t x t square through its corners 9/17/2018 CSE 5290, Fall 2011

Block Alignment: Examples
valid invalid 9/17/2018 CSE 5290, Fall 2011

Block Alignment Problem
Goal: Find the longest block path through an edit graph Input: Two sequences, u and v partitioned into blocks of size t. This is equivalent to an n x n edit graph partitioned into t x t subgrids Output: The block alignment of u and v with the maximum score (longest block path through the edit graph 9/17/2018 CSE 5290, Fall 2011

Constructing Alignments within Blocks
To solve: compute alignment score ßi,j for each pair of blocks |u(i-1)*t+1…ui*t| and |v(j-1)*t+1…vj*t| How many blocks are there per sequence? (n/t) blocks of size t How many pairs of blocks for aligning the two sequences? (n/t) x (n/t) For each block pair, solve a mini-alignment problem of size t x t 9/17/2018 CSE 5290, Fall 2011

Constructing Alignments within Blocks
Solve mini-alignmnent problems Block pair represented by each small square 9/17/2018 CSE 5290, Fall 2011

Block Alignment: Dynamic Programming
Let si,j denote the optimal block alignment score between the first i blocks of u and first j blocks of v block is the penalty for inserting or deleting an entire block i,j is score of pair of blocks in row i and column j. si-1,j - block si,j-1 - block si-1,j-1 - i,j si,j = max 9/17/2018 CSE 5290, Fall 2011

Block Alignment Runtime
Indices i,j range from 0 to n/t Running time of algorithm is O( [n/t]*[n/t]) = O(n2/t2) if we don’t count the time to compute each i,j 9/17/2018 CSE 5290, Fall 2011

Block Alignment Runtime (cont’d)
Computing all i,j requires solving (n/t)*(n/t) mini block alignments, each of size (t*t) So computing all i,j takes time O([n/t]*[n/t]*t*t) = O(n2) This is the same as dynamic programming How do we speed this up? 9/17/2018 CSE 5290, Fall 2011

Four Russians Technique
Let t = log(n), where t is block size, n is sequence size. Instead of having (n/t)*(n/t) mini-alignments, construct 4t x 4t mini-alignments for all pairs of strings of t nucleotides (huge size), and put in a lookup table. However, size of lookup table is not really that huge if t is small. Let t = (log n)/4. Then 4t x 4t = n 9/17/2018 CSE 5290, Fall 2011

Look-up Table for Four Russians Technique
AAAAAA AAAAAC AAAAAG AAAAAT AAAACA … each sequence has t nucleotides Lookup table “Score” AAAAAA AAAAAC AAAAAG AAAAAT AAAACA … size is only n, instead of (n/t)*(n/t) 9/17/2018 CSE 5290, Fall 2011

New Recurrence The new lookup table Score is indexed by a pair of t-nucleotide strings, so si-1,j - block si,j-1 - block si-1,j-1 – Score(ith block of v, jth block of u) si,j = max 9/17/2018 CSE 5290, Fall 2011

Four Russians Speedup: Runtime
Since computing the lookup table Score of size n takes O(n) time, the running time is mainly limited by the (n/t)*(n/t) accesses to the lookup table Each access takes O(logn) time Overall running time: O( [n2/t2]*logn ) Since t = logn, substitute in: O( [n2/{logn}2]*logn) > O( n2/logn ) 9/17/2018 CSE 5290, Fall 2011

So Far… We can divide up the grid into blocks and run dynamic programming only on the corners of these blocks In order to speed up the mini-alignment calculations to under n2, we create a lookup table of size n, which consists of all scores for all t-nucleotide pairs Running time goes from quadratic, O(n2), to subquadratic: O(n2/logn) 9/17/2018 CSE 5290, Fall 2011

Four Russians Speedup for LCS
Unlike the block partitioned graph, the LCS path does not have to pass through the vertices of the blocks. block alignment longest common subsequence 9/17/2018 CSE 5290, Fall 2011

Block Alignment vs. LCS In block alignment, we only care about the corners of the blocks. In LCS, we care about all points on the edges of the blocks, because those are points that the path can traverse. Recall, each sequence is of length n, each block is of size t, so each sequence has (n/t) blocks. 9/17/2018 CSE 5290, Fall 2011

Block Alignment vs. LCS: Points Of Interest
block alignment has (n/t)*(n/t) = (n2/t2) points of interest LCS alignment has O(n2/t) points of interest 9/17/2018 CSE 5290, Fall 2011

Traversing Blocks for LCS
Given alignment scores si,* in the first row and scores s*,j in the first column of a t x t mini square, compute alignment scores in the last row and column of the minisquare. To compute the last row and the last column score, we use these 4 variables: alignment scores si,* in the first row alignment scores s*,j in the first column substring of sequence u in this block (4t possibilities) substring of sequence v in this block (4t possibilities) 9/17/2018 CSE 5290, Fall 2011

Traversing Blocks for LCS (cont’d)
If we used this to compute the grid, it would take quadratic, O(n2) time, but we want to do better. we can calculate these scores we know these scores t x t block 9/17/2018 CSE 5290, Fall 2011

Four Russians Speedup Build a lookup table for all possible values of the four variables: all possible scores for the first row s*,j all possible scores for the first column s*,j substring of sequence u in this block (4t possibilities) substring of sequence v in this block (4t possibilities) For each quadruple we store the value of the score for the last row and last column. Creates a huge table -- can eliminate alignments scores that don’t make sense 9/17/2018 CSE 5290, Fall 2011

Reducing Table Size Alignment scores in LCS are monotonically increasing, and adjacent elements can’t differ by more than 1 Example: 0,1,2,2,3,4 is ok; 0,1,2,4,5,8, is not because 2 and 4 differ by more than 1 (and so do 5 and 8) Therefore, we only need to store quadruples whose scores are monotonically increasing and differ by at most 1 9/17/2018 CSE 5290, Fall 2011

Efficient Encoding of Alignment Scores
Instead of recording numbers that correspond to the index in the sequences u and v, we can use binary to encode the differences between the alignment scores original encoding 1 2 3 4 1 binary encoding 9/17/2018 CSE 5290, Fall 2011

Reducing Lookup Table Size
2t possible scores (t = size of blocks) 4t possible strings Lookup table size is (2t * 2t)*(4t * 4t) = 26t Let t = (logn)/4; Table size is: 26((logn)/4) = n(6/4) = n(3/2) Time = O( [n2/t2]*logn ) O( [n2/{logn}2]*logn) > O( n2/logn ) 9/17/2018 CSE 5290, Fall 2011

Summary We take advantage of the fact that for each block of t = log(n), we can pre-compute all possible scores and store them in a lookup table of size n(3/2) We used the Four Russian speedup to go from a quadratic running time for LCS to subquadratic running time: O(n2/log n) 9/17/2018 CSE 5290, Fall 2011

Next Graph algorithms Some of the following slides are based on slides by the authors of our text. 9/17/2018 CSE 5290, Fall 2011

DNA Sequencing Shear DNA into millions of small fragments
Read 500 – 700 nucleotides at a time from the small fragments (Sanger method) 9/17/2018 CSE 5290, Fall 2011

Fragment Assembly Computational Challenge: assemble individual short fragments (reads) into a single genomic sequence (“superstring”) Until late 1990s the shotgun fragment assembly of human genome was viewed as intractable problem 9/17/2018 CSE 5290, Fall 2011

Shortest Superstring Problem
Problem: Given a set of strings, find a shortest string that contains all of them Input: Strings s1, s2,…., sn Output: A string s that contains all strings s1, s2,…., sn as substrings, such that the length of s is minimized Complexity: NP – complete Note: this formulation does not take into account sequencing errors 9/17/2018 CSE 5290, Fall 2011

Shortest Superstring Problem: Example
9/17/2018 CSE 5290, Fall 2011

Reducing SSP to TSP Define overlap ( si, sj ) as the length of the longest prefix of sj that matches a suffix of si. aaaggcatcaaatctaaaggcatcaaa What is overlap ( si, sj ) for these strings? 9/17/2018 CSE 5290, Fall 2011

Reducing SSP to TSP Define overlap ( si, sj ) as the length of the longest prefix of sj that matches a suffix of si. aaaggcatcaaatctaaaggcatcaaa overlap=12 9/17/2018 CSE 5290, Fall 2011

Reducing SSP to TSP Define overlap ( si, sj ) as the length of the longest prefix of sj that matches a suffix of si. aaaggcatcaaatctaaaggcatcaaa Construct a graph with n vertices representing the n strings s1, s2,…., sn. Insert edges of length overlap ( si, sj ) between vertices si and sj. Find the shortest path which visits every vertex exactly once. This is the Traveling Salesman Problem (TSP), which is also NP – complete. 9/17/2018 CSE 5290, Fall 2011

Reducing SSP to TSP (cont’d)
9/17/2018 CSE 5290, Fall 2011

SSP to TSP: An Example S = { ATC, CCA, CAG, TCC, AGT } SSP AGT CCA TSP
ATCCAGT TCC CAG TSP ATC 2 1 1 AGT 1 CCA 1 2 2 2 1 CAG TCC ATCCAGT 9/17/2018 CSE 5290, Fall 2011

Sequencing by Hybridization (SBH): History
1988: SBH suggested as an an alternative sequencing method. Nobody believed it will ever work 1991: Light directed polymer synthesis developed by Steve Fodor and colleagues. 1994: Affymetrix develops first 64-kb DNA microarray First microarray prototype (1989) First commercial DNA microarray prototype w/16,000 features (1994) 500,000 features per chip (2002) 9/17/2018 CSE 5290, Fall 2011

How SBH Works Attach all possible DNA probes of length l to a flat surface, each probe at a distinct and known location. This set of probes is called the DNA array. Apply a solution containing fluorescently labeled DNA fragment to the array. The DNA fragment hybridizes with those probes that are complementary to substrings of length l of the fragment. 9/17/2018 CSE 5290, Fall 2011

How SBH Works (cont’d) Using a spectroscopic detector, determine which probes hybridize to the DNA fragment to obtain the l–mer composition of the target DNA fragment. Apply a combinatorial algorithm to reconstruct the sequence of the target DNA fragment from the l – mer composition. 9/17/2018 CSE 5290, Fall 2011

Hybridization on DNA Array
9/17/2018 CSE 5290, Fall 2011

l-mer composition Spectrum ( s, l ) - unordered multiset of all possible (n – l + 1) l-mers in a string s, length n The order of individual elements in Spectrum ( s, l ) does not matter For s = TATGGTGC all of the following are equivalent representations of Spectrum ( s, 3 ): {TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG} We usually choose the lexicographically ordered representation as the canonical one. 9/17/2018 CSE 5290, Fall 2011

Different sequences – same spectrum
Different sequences may have the same spectrum: Spectrum(GTATCT,2)= Spectrum(GTCTAT,2)= {AT, CT, GT, TA, TC} 9/17/2018 CSE 5290, Fall 2011

The SBH Problem Goal: Reconstruct a string from its l-mer composition
Input: A set S, representing all l-mers from an (unknown) string s Output: String s such that Spectrum ( s,l ) = S 9/17/2018 CSE 5290, Fall 2011

SBH: Hamiltonian Path Approach
S = { ATG AGG TGC TCC GTC GGT GCA CAG } H ATG AGG TGC TCC GTC GGT GCA CAG ATG C A G G T C C Path visited every VERTEX once 9/17/2018 CSE 5290, Fall 2011

A more complicated graph: S = { ATG TGG TGC GTG GGC GCA GCG CGT } 9/17/2018 CSE 5290, Fall 2011

S = { ATG TGG TGC GTG GGC GCA GCG CGT } Path 1: ATGCGTGGCA Path 2: ATGGCGTGCA 9/17/2018 CSE 5290, Fall 2011

SBH: Eulerian Path Approach
S = { ATG, TGC, GTG, GGC, GCA, GCG, CGT } Vertices correspond to ( l – 1 ) – mers : { AT, TG, GC, GG, GT, CA, CG } Edges correspond to l – mers from S AT GT CG CA GC TG GG Path visited every EDGE once 9/17/2018 CSE 5290, Fall 2011

SBH: Eulerian Path Approach
S = { AT, TG, GC, GG, GT, CA, CG } corresponds to two different paths: GT CG GT CG AT TG GC AT TG GC CA CA GG GG ATGGCGTGCA ATGCGTGGCA 9/17/2018 CSE 5290, Fall 2011

Euler Theorem A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing edges: in(v)=out(v) Theorem: A connected graph is Eulerian if and only if each of its vertices is balanced. 9/17/2018 CSE 5290, Fall 2011

Euler Theorem: Proof Eulerian → balanced
for every edge entering v (incoming edge) there exists an edge leaving v (outgoing edge). Therefore in(v)=out(v) Balanced → Eulerian ??? 9/17/2018 CSE 5290, Fall 2011

Algorithm for Constructing an Eulerian Cycle
Start with an arbitrary vertex v and form an arbitrary cycle with unused edges until a dead end is reached. Since the graph is Eulerian this dead end is necessarily the starting point, i.e., vertex v. 9/17/2018 CSE 5290, Fall 2011

Algorithm for Constructing an Eulerian Cycle (cont’d)
b. If cycle from (a) above is not an Eulerian cycle, it must contain a vertex w, which has untraversed edges. Perform step (a) again, using vertex w as the starting point. Once again, we will end up in the starting vertex w. 9/17/2018 CSE 5290, Fall 2011

Algorithm for Constructing an Eulerian Cycle (cont’d)
c. Combine the cycles from (a) and (b) into a single cycle and iterate step (b). 9/17/2018 CSE 5290, Fall 2011

Euler Theorem: Extension
Theorem: A connected graph has an Eulerian path if and only if it contains at most two semi-balanced vertices and all other vertices are balanced. 9/17/2018 CSE 5290, Fall 2011

Some Difficulties with SBH
Fidelity of Hybridization: difficult to detect differences between probes hybridized with perfect matches and 1 or 2 mismatches Array Size: Effect of low fidelity can be decreased with longer l-mers, but array size increases exponentially in l. Array size is limited with current technology. Practicality: SBH is still impractical. As DNA microarray technology improves, SBH may become practical in the future Practicality again: Although SBH is still impractical, it spearheaded expression analysis and SNP analysis techniques 9/17/2018 CSE 5290, Fall 2011

Traditional DNA Sequencing
Shake DNA fragments Known location (restriction site) Vector Circular genome (bacterium, plasmid) + = 9/17/2018 CSE 5290, Fall 2011

Different Types of Vectors
Size of insert (bp) Plasmid 2, ,000 Cosmid 40,000 BAC (Bacterial Artificial Chromosome) 70, ,000 YAC (Yeast Artificial Chromosome) > 300,000 Not used much recently 9/17/2018 CSE 5290, Fall 2011

Shotgun Sequencing Get one or two reads from each segment
genomic segment cut many times at random (Shotgun) Get one or two reads from each segment ~500 bp ~500 bp 9/17/2018 CSE 5290, Fall 2011

Fragment Assembly Cover region with ~7-fold redundancy
reads Cover region with ~7-fold redundancy Overlap reads and extend to reconstruct the original genomic region 9/17/2018 CSE 5290, Fall 2011

Read Coverage C How much coverage is enough? Lander-Waterman model:
Length of genomic segment: L Number of reads: n Coverage C = n l / L Length of each read: l How much coverage is enough? Lander-Waterman model: Assuming uniform distribution of reads, C=10 results in 1 gapped region per 1,000,000 nucleotides 9/17/2018 CSE 5290, Fall 2011

Challenges in Fragment Assembly
Repeats: A major problem for fragment assembly > 50% of human genome are repeats: - over 1 million Alu repeats (about 300 bp) - about 200,000 LINE repeats (1000 bp and longer) Repeat Green and blue fragments are interchangeable when assembling repetitive DNA 9/17/2018 CSE 5290, Fall 2011

Triazzle: A Fun Example
The puzzle looks simple BUT there are repeats!!! The repeats make it very difficult. Try it – available at 9/17/2018 CSE 5290, Fall 2011

Repeat Types Low-Complexity DNA (e.g. ATATATATACATA…)
Microsatellite repeats (a1…ak)N where k ~ 3-6 (e.g. CAGCAGTAGCAGCACCAG) Transposons/retrotransposons SINE Short Interspersed Nuclear Elements (e.g., Alu: ~300 bp long, 106 copies) LINE Long Interspersed Nuclear Elements ~ ,000 bp long, 200,000 copies LTR retroposons Long Terminal Repeats (~700 bp) at each end Gene Families genes duplicate & then diverge Segmental duplications ~very long, very similar copies 9/17/2018 CSE 5290, Fall 2011

CSE 5290: Algorithms for Bioinformatics Fall 2011

Similar presentations

Presentation on theme: "CSE 5290: Algorithms for Bioinformatics Fall 2011"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE 5290: Algorithms for Bioinformatics Fall 2011

Similar presentations

Presentation on theme: "CSE 5290: Algorithms for Bioinformatics Fall 2011"— Presentation transcript:

Similar presentations

About project

Feedback