Computer Science Background for Biologists CSC 487/687 Computing for Bioinformatics Fall 2005.

Computer Science Background for Biologists CSC 487/687 Computing for Bioinformatics Fall 2005

What is algorithm  Well-defined computational procedure that takes some values as input and produces some value as output.  We are interested in the correctness and efficiency of computer algorithms  We seek to extract clean, well-defined problems from the typically messy “real” problem to gain insight into it.

Example of an algorithm  Input: A sequence of n numbers (a 1, a 2, …a n ).  Output: A permutation (a’ 1, a’ 2, …a’ n ) of the input sequence such that a’ 1 ≤ a’ 2 ≤ …a’ n.

Exact String Matching  Input: A text string T, where |T| = n, and a pattern string P, where |P| = m.  Output: An index i such that T i+k-1 = P k for all 1 ≤ k ≤ m, i.e. showing that P is a substring of T. abcabaabcabac abaa Text T: Pattern P:

Exact String Matching  Brute force search algorithm for i =1 to n-m+1 do j=1; while ( T[i+j-1] == P[j] ) and (j <= m) j=j+1; if (j > m) then print “pattern at position ”, i;

Algorithm Efficiency  Time efficiency of algorithms  Space efficiency of algorithms

Machine Independent Analysis We assume that every basic operation takes constant time: Example Basic Operations: Addition, Subtraction, Multiplication, Memory Access Time efficiency of an algorithm is the number of basic operations it performs We do not distinguish between the basic operations.

Time efficiency  In fact, we will not worry about the exact values, but will look at ``broad classes’ of values.  Let there be n inputs. If an algorithm needs n basic operations and another needs 2n basic operations, we will consider them to be in the same efficiency category. However, we distinguish between exp(n), n, log(n)

Example: Time Complexity  This algorithm might use only n steps if we are lucky.  We might need about n*m steps if we are unlucky

Order of Increase We worry about the increase speed of our algorithms with increased input sizes. n log n exp (n)

Function Orders A function f(n) is O(g(n)) if ``increase’’ of f(n) is not faster than that of g(n). A function f(n) is O(g(n)) if there exists a number n0 and a nonnegative c such that for all n  n0, 0  f(n)  cg(n). If limn  f(n)/g(n) exists and is finite, then f(n) is O(g(n))

Implication of Big oh notation  Big oh notation ― an upper bound on the number of steps that an algorithm takes in the worst case. Suppose we know that our algorithm uses at most O(f(n)) basic steps for any n inputs, and n is sufficiently large, then we know that our algorithm will terminate after executing at most constant times f(n) basic steps. We know that a basic step takes a constant time in a machine. Hence, our algorithm will terminate in a constant times f(n) units of time, for all large n.

Algorithm Complexity  Thus the brute force string matching algorithm is O(mn), or takes quadratic time  An quadratic time algorithm is usually fast enough for small problems, but not big ones.  An exponential-time algorithm can only be fast enough for tiny problems

Any improvement based on brute force search?  Some of these comparisons are wasted work!  By being more clever, we can reduce the worst case running time to O(n+m)  Knuth-Morris-Pratt string matching

NP, NP hard, NP complete Problems  A problem is assigned to the NP class if it can be verified in polynomial time.  A problem is NP-hard if an algorithm for solving it can be translated into one for solving any other NP-problemalgorithmNP-problem  NP-hard therefore means "at least as hard as any NP-problem,“NP-problem  NP-complete: it is both NP problem and NP- hard problem

NP-Completeness  Unfortunately, for many problems, there is no known polynomial algorithm  Even worse, most of these problems can be proven NP-complete, meaning that no such algorithm can exist!  Heuristics, approximate

Shortest Common Superstring  Input: A set S = {s 1, s 2, … s m } of text strings on some alphabet £.  Output: the shortest possible string T such that each s i is a substring of T.  This application arises in DNA sequencing

Shortest common superstring

 NP-complete problems.  Can you suggest an algorithm to find the shortest common superstring?  Greedy heuristic ― approximate optimal solution

Greedy Heuristic  We always merge the two strings with the longest overlap  Put the combined string back  Repeat until only one string remains  GREEDY finds a superstring of length at most twice optimal

Time complexity of the greedy heuristic  We assume n strings, each string has a length of k. N rounds O(N 2 ) strings comparisons Each string comparison takes k 2 steps.

Computer Science Background for Biologists CSC 487/687 Computing for Bioinformatics Fall 2005.

Similar presentations

Presentation on theme: "Computer Science Background for Biologists CSC 487/687 Computing for Bioinformatics Fall 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Science Background for Biologists CSC 487/687 Computing for Bioinformatics Fall 2005.

Similar presentations

Presentation on theme: "Computer Science Background for Biologists CSC 487/687 Computing for Bioinformatics Fall 2005."— Presentation transcript:

Similar presentations

About project

Feedback