Presentation is loading. Please wait.

Presentation is loading. Please wait.

Approximate Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.

Similar presentations


Presentation on theme: "Approximate Alignment Vasileios Hatzivassiloglou University of Texas at Dallas."— Presentation transcript:

1 Approximate Alignment Vasileios Hatzivassiloglou University of Texas at Dallas

2 Midterm Focus on understanding, not memorization Three types of questions –Direct test of your knowledge from the class/slides (30%) –Thinking questions (40%) –Problems (30%) Date: Tuesday, October 14

3 Sample direct question What is mRNA? Name two processes where it plays a role. [2 sentences] Answer: mRNA is a complementary copy of the DNA in a gene. It is involved in transcription (from DNA to mRNA) and translation (from mRNA to proteins via tRNA).

4 Sample thinking question A biologist has discovered a method that reports quickly the total number of each type of nucleotide in the DNA inside the nucleus of a cell. Using this method, he reports that for Rattus Norvegicus, the distribution is 22% Adenine, 29% Cytocine, 18% Guanine, and 31% Thymine. Explain why his results cannot possibly be correct. [1-2 sentences]

5 Answer Since the analysis is supposedly performed over the entire nucleus, it should include both complementary strands in each chromosome. So the number of nucleotides (and percentage) of A and T (and of C and G) should be the same.

6 Sample problem question The Longest Common Subsequence (LCS) problem has as follows: Given strings S and T find the longest string R that is a subsequence of both S and T. Which algorithm among the ones we discussed in class can be applied without modifications to solve this problem? State the appropriate parameters for that algorithm so that it work for the LCS problem. [2-5 sentences]

7 Answer The LCS problem is the same as local alignment if we never allow a mismatch, score all matches with the same positive number, and do not penalize for indels. Then the DP algorithm discussed in class will find the longest (because matches improve the score) part of each string that can be exactly matched freely deleting characters from each string, i.e., the LCS. The corresponding parameters are σ(x,x) = 1 (any positive number will work), σ(x,y) = -∞ for different non-space x and y, and σ(x,-)= σ(-,x)=0.

8 How high is O(nm)? Suppose we are matching a given protein (300 amino acids) with SwissProt Current SwissProt stats (September 2008) –397,500 entries –143 million amino-acids (360 amino acids on average) Need 397,500 × 360 × 300 (≈42.9 billion) time units and 360 × 300 (=108,000) space units (all multiplied by a constant)

9 Two settings Multiple comparisons, short sequences –Parallelization One or more comparisons, very long sequences (e.g., DNA) –Space used to be the critical factor because of limits in the size of direct-access memory –Modern machines can handle the space requirements for almost all comparisons, so time is now the important factor

10 Heuristic alignment O(nm) may be still too much for long sequences Rather than finding the true optimal alignment, follow a heuristic approach that is likely to produce a good alignment –“good” generally not as good as optimal –sometimes a high-scoring alignment will be completely missed

11 Basis for heuristic alignment What if there were no gaps? –Efficient algorithms exist for aligning –Knuth-Morris-Pratt algorithm, O(n+m) time and space Any good alignment likely has one or more regions with exact matches and no gaps Find such hot spots and proceed from there

12 FASTA – Step 1 FASTA = Fast Alignment Find all words of length k that exactly match between the two sequences (hot spots) To avoid O(nm) complexity –Construct a hash table for one of the strings, where the keys are the possible words and the values are their starting positions –|Σ| k such strings, O(kn) complexity –Match the second string in O(km) time

13 Word size effects As k becomes larger –The algorithm becomes linearly slower –The algorithm takes exponentially more space –It is more difficult to find exact matches, hence –the algorithm becomes more selective and less sensitive Typical values for k are 2 for proteins and 4-6 for DNA


Download ppt "Approximate Alignment Vasileios Hatzivassiloglou University of Texas at Dallas."

Similar presentations


Ads by Google