Presentation is loading. Please wait.

Presentation is loading. Please wait.

Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.

Similar presentations


Presentation on theme: "Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am."— Presentation transcript:

1 Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am going to show you how biologist solved that problem – Binary Code  DNA Code – How the program works  How life works But, we are approaching it from the Bottom-up

2 Bottom-up Design Top Down – See the big picture first – Break it into part – Analyze each part – Continue breaking down sub-part into solvable tasks Bottom Up – Identify easily solvable task – Use them to solve larger problem – Use the solution to larger and larger problems to solve the BIG problem and see the big picture

3 Bottom-up Design Top Down – Rethinking the design of existing ideas/inventions – Managing projects that are underway – Works really good in the Utopian world Bottom Up – Designing totally new ideas – Putting together projects from scratch – Works really good in the real world

4 Bottom-up Design Top Down – Let build an airplane – Lets build a steering mechanism – Lets build a lift mechanism – Lets build a propulsion mechanism Bottom UP – This shape produces lift – A spinning propeller creates propulsion in the air – Canvas with a wood frame is light enough – Perhaps we can build an stable controllable airplane

5 Bottom-up Design Before we can analyze the big picture We have to – Look at some of the initial smaller problems – See how they were solved – See how they led to new discoveries

6 Remember Don’t forget to – pick a paper and – Email me See the schedule to see what’s taken – http://www.cs.siena.edu/~ebreimer/csis-400-f03/schedule.html http://www.cs.siena.edu/~ebreimer/csis-400-f03/schedule.html

7 Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)

8 Agenda Overview of Shared Pattern Discovery Edit Distance – How do you compute it – Why its not good enough Alignment – Why its better – How to compute it

9 Shared Pattern Discovery I have 100 rats that all have green eyes I have 1000 rats that all have blue eyes What exactly do the 100 rats have in common that give them green eyes?

10 Shared Pattern Discovery A technique called multiple alignment can be used to measure the strength a genomic pattern found in a set of sequences (a group of rats) – You can identify a subset (rats that have green eyes) and – You can find a sub-region of DNA (a pattern) that the subset shares – But that isn’t shared by any other subset (rats that have blue eyes) Initially, this is how genes were pin-pointed

11 Shared Pattern Discovery To understand multiple alignment One needs to understand pair-wise alignment Multiple alignment emerged from the successful application of pair-wise alignment Pair-wise alignment emerged from improvements to traditional string matching algorithms All of this emerged from a need to compare genetic sequences

12 Exact string matching Target: CGTACGAC Pattern: ACGTACGTACGT Problem: Target can not be found in the pattern even though its really close

13 Edit Distance How many edits are needed to exactly match the target with part of the pattern Target: CGTACGAC Pattern: ACGTACGTACGT Just one

14 Edit Distance How many edits are needed to exactly match the target with the WHOLE Pattern Target: CGTACGAC Pattern: ACGTACGTACGT Four

15 Edit Distance – Dynamic Programming ACGTCGCAT A C G T G T G C 01234567891 2 3 4 5 6 7 8 0 12345678 1 21234567232123456323212345432321234543432123654543234765655323 Optimal edit distance for TG and TCG Optimal edit distance for TG and TCGA Optimal edit distance for TGA and TCG Final Answer Optimal edit distance for TGA and TCGA

16 Edit Distance int matrix[n+1][m+1]; for (x = 0; x <= n; x++) matrix[x][0] = x; for (y = 1; y <= m; y++) matrix [0][y] = y; for (x = 1; x <= n; x++) for (y = 1; y <= m; y++) if (seq1[x] == seq2[y]) matrix[x][y] = matrix[x-1][y-1]; else matrix[x][y] = max(matrix[x][y-1] + 1, matrix[x-1][y] + 1); return matrix[n][m]; 01234567891 2 3 4 5 6 7 8 012345678 121234567 232123456 323212345 432321234 543432123 654543234 765655323

17 Edit Distance int matrix[n+1][m+1]; for (x = 0; x <= n; x++) matrix[x][0] = x; for (y = 0; y <= m; y++) matrix [0][y] = y; for (x = 1; x <= n; x++) for (y = 1; y <= m; y++) if (seq1[x] == seq2[y]) matrix[x][y] = matrix[x-1][y-1]; else matrix[x][y] = max(matrix[x][y-1] + 1, matrix[x-1][y] + 1); return matrix[n][m]; How many times is this comparison performed? How many times is this assignment performed?

18 Edit Distance – Dynamic Programming ACGTCGCAT A C G T G T G C 01234567891 2 3 4 5 6 7 8 0 12345678 1 21234567232123456323212345432321234543432123654543234765655323 To derive the value 7, we need to know that we can match two T’s n=8 In the worst case, this may take n comparisons To derive the value 6, we need to know that we can match two C’s after matching two T’s To derive this value 5, we need to know that we can match two G’s after already matching two C’s and previously matching two T’s

19 Edit Distance – Dynamic Programming ACGTCGCAT A C G T G T G C 01234567891 2 3 4 5 6 7 8 0 12345678 1 21234567232123456323212345432321234543432123654543234765655323 Given our previous matches, there is no way we can match two A’s Thus, the edit distance is increased Luckily, we can match these two C’s But now we’ve matched the last symbol We can’t do any more matching (period!)

20 Lesson to learn There is no way to compute the optimal (minimum) edit distance without considering all possible matching combinations. The only way to do that is to consider all possible sub-problems. This is the reason the entire table must be considered. If you can compute the optimal (minimum) edit distance using less than O(nm) computations. Then you will be renown!

21 Why Edit Distances Stinks for Genetic Data? DNA evolves in strange ways …TAGATCCCAGATCAGTATTCAAGTTATAC…. …GATCTCCCAGATAGAAGCAGTATTCAGTCA… … CCTATCAGCAGGATCAAGTATGTCATACTAC… The edit distance between rat and virus is smaller than rat and fruit bat. This is a gene in the rat genome This is the same gene in the fruit bat This is a totally unrelated region of the AIDS virus

22 Alignment We need a more robust way to measure similarity Alignment meets several requirements 1. It rewards matches 2. It penalizes mismatches 3. It allows for different strategies for penalizing gaps 4. It helps visualize similarity.

23 Alignment Example 1. G A A T T C A G T T A (sequence #1) 2. G G A T C G A (sequence #2) One possible alignment: G A A T T C A G T T A G G A _ T C _ G _ _ A MismatchGap Gap (size 2)

24 Alignment A simple scoring scheme is used where – Si,j is the score at position i,j – Si,j = 1 if the residue at position i of sequence #1 matches the residue at position j of sequence #2 (match score); otherwise – Si,j = 0 (mismatch score) w is the gap penalty which we will discuss later

25 Alignment Three steps in the dynamic programming algorithm for alignment 1. Initialization 2. Matrix fill (scoring) 3. Traceback (alignment)

26 Alignment Initialization Step The first step create a matrix with – M + 1 columns and – N + 1 rows – where M and N correspond to the size of the sequences to be aligned. The first row and first column of the matrix can be initially filled with 0.

27 Alignment

28 Matrix Fill Step – For each position, Mi,j is defined to be the maximum score at position i,j; i.e. – Mi,j = MAX[ Mi-1, j-1 + Si,j (match/mismatch), Mi,j-1 + w (gap in sequence #1), Mi-1,j + w (gap in sequence #2) ]

29 Example The score at position 1,1 can be calculated. The first residue in both sequences is a G Thus, S 1,1 = 1 Thus, M 1,1 = MAX[M 0,0 + 1, M 1,0 + 0, M 0,1 + 0] = MAX[1, 0, 0] = 1.

30 Example

31

32

33

34 Edit Dist. vs. Alignment Scoring Note that the metric used in alignment is different that that of edit distance – Smaller edit distance  more similar – Higher alignment score  more similar Also: Edit distance refers specifically to edits – delete or insert a symbol – discrete value – not flexible

35 Alignment Scoring Mi,j = MAX[ Mi-1, j-1 + Si,j (match/mismatch score), Mi,j-1 + w (gap in sequence #1), Mi-1,j + w (gap in sequence #2) ] S i,j ACGT A1.10.00.30.5 C1.30.10.0 G1.00.0 T1.2

36 Alignment Scoring Mi,j = MAX[ Mi-1, j-1 + Si,j (match/mismatch score), Mi,j-1 + w (gap in sequence #1), Mi-1,j + w (gap in sequence #2) ] w = -1 One possible alignment: G A A T T C A G T T A G G A _ T C _ G _ _ A Gap -1 Gap -2

37 Alignment Scoring Summary: We have a way of rewarding different types of matches and mismatches We have a separate way of penalizing gaps We could choose not to penalize gaps – if we had a clue that they weren’t harmful

38 Recall DNA evolves in strange ways …TAGATCCCAGATCAGTATTCAAGTTATAC…. …GATCTCCCAGATAGAAGCAGTATTCAGTCA… … CCTATCAGCAGGATCAAGTATGTCATACTAC… The edit distance between rat and virus is smaller than rat and fruit bat. This is a gene in the rat genome This is the same gene in the fruit bat This is a totally unrelated region of the AIDS virus

39 Tracing back the alignment (Seq #1) A | (Seq #2) A

40 Tracing back the alignment (Seq #1) A | (Seq #2) A

41 Tracing back the alignment (Seq #1) TA | (Seq #2) A

42 Tracing back the alignment (Seq #1) TTA | (Seq #2) A

43 Tracing back the alignment (Seq #1) GAATTCAGTTA | | || | | (Seq #2) GGA_TC_G__A


Download ppt "Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am."

Similar presentations


Ads by Google