Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Similar presentations


Presentation on theme: "Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey."— Presentation transcript:

1 Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey

2 Bioinformatics Workshop, Fall 2003 Topics Algorithm basics Types of algorithms in bioinformatics Sequence alignment Database Searches

3 Bioinformatics Workshop, Fall 2003 Algorithm basics What is an algorithm? Algorithm complexity P vs. NP NP completeness

4 Bioinformatics Workshop, Fall 2003 What is an algorithm? An algorithm is a step-by-step procedure to solve a problem The word “algorithm” comes from the 9 th century Islamic mathematician al- Khwarizmi

5 Bioinformatics Workshop, Fall 2003 Algorithm Complexity If the algorithm works with n pieces of data and the number of steps is proportional to n, then we say that the running time is O(n). If the number of steps is proportional to log n, then the running time is O(log n).

6 Bioinformatics Workshop, Fall 2003 Example Problem: find the largest element in a sequence of n elements. Solution idea: Iteratively compare size of elements in sequence.

7 Bioinformatics Workshop, Fall 2003 Algorithm: 1.Initialize first element as largest. 2.For each remaining element. If current element larger than largest, make that element largest. Running time: O(n)

8 Bioinformatics Workshop, Fall 2003 Polynomial Time An algorithm is said to run in polynomial time if its running time can be written in the form O(n k ) for some power k. The underlying problem is said to be of class P.

9 Bioinformatics Workshop, Fall 2003 Polynomial Time Examples Searching Binary Search: O(log n) Sorting Quick Sort: O(n log n)

10 Bioinformatics Workshop, Fall 2003 NP Algorithms An algorithm is nondeterministic if it begins with guessing a solution to the problem and then verifies the guess. A problem is of category NP if there is a nondeterministic algorithm for that problem which runs in polynomial time.

11 Bioinformatics Workshop, Fall 2003 NP Complete A problem is NP-complete if it has an NP algorithm, and solutions to this problem can be used to solve all other NP problems. A problem is NP-hard if it is at least as hard as the NP-complete problems

12 Bioinformatics Workshop, Fall 2003 NP Complete Examples Traveling salesman Knapsack problem Partition problem Graph coloring

13 Bioinformatics Workshop, Fall 2003 P = NP ? P  NP If P  NP then NP-complete problems have exponential running time.

14 Bioinformatics Workshop, Fall 2003 Polynomial vs. Exponential

15 Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Algorithms to compare DNA, RNA, or protein sequences Database searches to find homologous sequences Sequence assembly Construction of evolutionary trees Structure prediction

16 Bioinformatics Workshop, Fall 2003 Edit operations on sequences AATAAGC ATTAAGC AAT-AAGC AATTAAGC AATAAGC AA-AAGC SubstitutionInsertionDeletion

17 Bioinformatics Workshop, Fall 2003 What is sequence alignment? Compare two sequences using matches, substitutions and indels. G A A - - T C A T G - T G G - C A - 3 matches, 1 substitution, 5 indels

18 Bioinformatics Workshop, Fall 2003 Complexity of DNA Problems 3 billion base pairs in human genome Many NP complete problems 10 600 possible alignments for two 1000 character sequences

19 Bioinformatics Workshop, Fall 2003 Types of sequence alignment Determine the alignment of two sequences that maximizes similarity (global alignment) Determine substrings of two sequences with maximum similarity (local alignment) Determine the alignment for several sequences that maximizes the sum of pairs similarity (multiple alignment)

20 Bioinformatics Workshop, Fall 2003 Significance of Alignment Functional similarity Structural similarity Homology

21 Bioinformatics Workshop, Fall 2003 Scoring System Assign a score for each possible match, substitution and indel Distance functions – Find alignment to minimize distance between sequences Similarity functions – Find alignment to maximize similarity between sequences

22 Bioinformatics Workshop, Fall 2003 Edit Distance G A A - - T C A T G - T G G - C A - Similarity function: 1 for match, -1 for substitution, -2 for indel Score: -8

23 Bioinformatics Workshop, Fall 2003 Dynamic Programming Used on optimization problems Bottom-up approach Recursively builds up solution from subproblem optimal solutions

24 Bioinformatics Workshop, Fall 2003 Dynamic Programming Alignment Algorithm (Needleman-Wunsch) Given sequences a 1,a 2,…,a n and b 1,b 2,…,b m to be aligned: Initialize alignment matrix (aligning with spaces) Entry [i,j] gives optimal alignment score for sequences a 1,a 2,…,a i and b 1,b 2,…,b j (where 1  i  n, 1  j  m)

25 Bioinformatics Workshop, Fall 2003 Computing Alignment Matrix Match a i+1 with b j+1 Match a i+1 with a space — Match b j+1 with a space — If a 1,a 2,…,a i and b 1,b 2,…,b j have been aligned, there are three possible next moves: Choose the move that maximizes the similarity of the two sequences

26 Bioinformatics Workshop, Fall 2003 Global Alignment Matrix —GGACA —0-2-4-6-8-10 G-21-3-5-7 G-420-2-4 G-6-301-3 C-8-5-220 A-10-7-403 T-12-9-6-3-21

27 Bioinformatics Workshop, Fall 2003 Optimal Global Alignment GGGCAT GGACA—

28 Bioinformatics Workshop, Fall 2003 Alignment Running Time Assuming two sequences n characters each Running time is O(n 2 ) (each entry of matrix must be calculated)

29 Bioinformatics Workshop, Fall 2003 Variations of Alignment Algorithm Gap penalty Local alignment Multiple alignment

30 Bioinformatics Workshop, Fall 2003 Gap Penalty A gap is a number k of consecutive spaces k consecutive spaces are more probable than k isolated spaces Typical gap penalty function: a + b·k (affine gap penalty) Here the first space in a gap is penalized a+b, further spaces are penalized b each.

31 Bioinformatics Workshop, Fall 2003 Gap Penalty Example Use penalty, 1 + k A - A - C - A A C T A T C A Score: -6 A A C - - - A A C T A T C A Score: -4

32 Bioinformatics Workshop, Fall 2003 Local Alignment Find conserved regions in otherwise dissimilar sequences (e.g., viral and host DNA) Smith-Waterman algorithm Includes a fourth possibility at each step (don’t align)

33 Bioinformatics Workshop, Fall 2003 Local Alignment Example Align the following G C T C T G C G A A T A C G T T G A G A T A C T

34 Bioinformatics Workshop, Fall 2003 Optimal Local Alignment G C T C T G C G A A T A C G T T G A G A T A C T (G C T C) T G C G A A T A (C G T) T G A G - A T A (C T)

35 Bioinformatics Workshop, Fall 2003 Multiple Alignment Find the alignment among a set of sequences that maximizes the sum of scores for all pairs of sequences Dynamic programming run-time for k sequences of length n: O(k 2 2 k n k ) Multiple alignment is NP-complete

36 Bioinformatics Workshop, Fall 2003 Other Features Usually used for protein alignment Can be used for global or local alignment

37 Bioinformatics Workshop, Fall 2003 Multiple Alignment Example PEAALYGRFT---IKSDVW PESLAYNKF---SIKSDVW PEALNYGRY---SSESDVW PEALNYGWY---SSESDVW PEVIRMQDDNPFSFQSDVY

38 Bioinformatics Workshop, Fall 2003 Multiple vs. Pairwise Alignment Optimal multiple alignment does not imply optimal pairwise alignment ATA - A - - T - T

39 Bioinformatics Workshop, Fall 2003 Substitution Matrices In homologous sequences certain amino acid substitutions are more likely to occur than others Types of substitution matrices *PAM *BLOSUM

40 Bioinformatics Workshop, Fall 2003 PAM Matrices Defines units of evolutionary distance 1 PAM unit represents an average of one mutation per 100 amino acids Start with a set of highly similar sequences and compute *p a = probability of occurrence of amino acid a *M ab = probability of a mutating to b

41 Bioinformatics Workshop, Fall 2003 PAM Matrix Formula Entries in a k-PAM matrix

42 Bioinformatics Workshop, Fall 2003 PAM250 Matrix CSTPAGNDEQHRKMILVFYW C12 S02 T-213 P-3106 A-21112 G-31015 N-410002 D-5000124 E-50000134 Q-5 00 1224 H-3 0 -221136 R-400-2-30 126 K-500 -21001035 M-5-2-2-3-2-3-2-2006 I 0-2-3-2 25 L-6-3-2-3-2-4-3-4-3-2 -3 426 V-20 0 -2 2424 F-4-3 -5-4-5-4-6-5 -2-4-50129 Y0-3 -5-3-5-2-4 0 -2 -2710 W-8-2-5-6 -7-4-7 -5-32 -4-5-2-60017

43 Bioinformatics Workshop, Fall 2003 BLOSUM Matrices (Omit) Uses log-odds ratio similar to PAM Uses short highly conserved sequences BLOSUM x matrices created after removing sequences that are more than x percent identical Better at local alignments

44 Bioinformatics Workshop, Fall 2003 BLOSUM Matrices A motif is a conserved amino acid pattern found in a group of proteins with similar biological meaning (PROSITE) A block is a conserved amino acid pattern in a group of proteins (no spaces allowed in the pattern) (BLOCKS)

45 Bioinformatics Workshop, Fall 2003 Motif Example Motif obtained from a group of 34 tubulin proteins M[FYW].. F[VLI]H. [FYW].. EGM

46 Bioinformatics Workshop, Fall 2003 Defining BLOSUM (I) BLOSUMn uses blocks that are n% identical (BLOSUM62 is most common) Consider all pairs of amino acids appearing in the same column in the blocks

47 Bioinformatics Workshop, Fall 2003 Defining BLOSUM (II) Define n(i,j) to be the frequency that amino acids i,j appear in a column pair Define e(i,j) to be the frequency that amino acids i,j appear in any pair Define BLOSUM entry

48 Bioinformatics Workshop, Fall 2003 PAM vs. BLOSUM PAM derived from highly similar sequences (evolutionary model) BLOSUM derived from protein families sharing a common ancestor (conserved domain model)

49 Bioinformatics Workshop, Fall 2003 Database Searches FASTA BLAST

50 Bioinformatics Workshop, Fall 2003 FASTA Looks for sequences in a database similar to a query sequence Heuristic, exclusion method Compares query sequence to each database sequence (called the text)

51 Bioinformatics Workshop, Fall 2003 FASTA Algorithm (I) Look for small substrings in query and text that exactly match (“hot spots”) Find ten best “diagonal runs” of hot spots

52 Bioinformatics Workshop, Fall 2003 Hot Spot Example E K L A S R K L H A * S * H K * L *

53 Bioinformatics Workshop, Fall 2003 FASTA Algorithm (II) Find best local alignment for each run Combine these into larger alignment Do multiple alignment on query and texts having highest score in last step

54 Bioinformatics Workshop, Fall 2003 BLAST Basic Local Alignment Search Tool Heuristic, exclusion method Computes statistical significance of alignment scores

55 Bioinformatics Workshop, Fall 2003 BLAST Algorithm Find all w-length substrings in text that align to some w-length substring in query with score above a given threshold (called “hits”) Extend these hits as far as possible (“segment pairs”) Report the highest scoring segment pairs

56 Bioinformatics Workshop, Fall 2003 Other Bioinformatics Algorithms Palindromes Tandem Repeats Longest Common Subsequence Double Digest (NP complete) Shortest Common Superstring (NP complete)

57 Bioinformatics Workshop, Fall 2003 References Clote and Backofen, Computational Molecular Biology, Wiley Gusfield, Algorithms on Strings, Trees, and Sequences, Cambridge University Press Mount, Bioinformatics, Cold Spring Harbor Press Setubal and Meidanis, Introduction to Computational Molecular Biology, PWS Waterman, Introduction to Computational Biology, CRC Press


Download ppt "Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey."

Similar presentations


Ads by Google