Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 262 Discussion Section 1. Purpose of discussion sections To clarify difficulties/ambiguities in the problem set questions and lecture material. To.

Similar presentations


Presentation on theme: "CS 262 Discussion Section 1. Purpose of discussion sections To clarify difficulties/ambiguities in the problem set questions and lecture material. To."— Presentation transcript:

1 CS 262 Discussion Section 1

2 Purpose of discussion sections To clarify difficulties/ambiguities in the problem set questions and lecture material. To supplement class material by going somewhat into the biological concepts and motivations underlying this field. To discuss more algorithms from a topic, wherever needed.

3 Antiparallel vs Parallel strands

4 The DNA strand has a chemical polarity

5 The members of each base pair can fit together within the double helix only if the two strands of the helix are antiparallel

6 Prokaryotes do not have a nucleus, eukaryotes do

7 Eukaryotic DNA is packaged into chromosomes A chromosome is a single, enormously long, linear DNA molecule associated with proteins that fold and pack the fine thread of DNA into a more compact structure. Human Genome: 3.2 x 10 9 base pairs distributed over 46 chromosomes.

8

9

10

11

12

13

14 A display of the full set of 46 chromosomes

15 Sequence similarity

16 Biological motivation Sequence similarity is useful in hypothesizing the function of a new sequence… … assuming that sequence similarity implies structural and functional similarity. Sequence Database Query New Sequence List of similar matches Response

17 Case Study: Multiple Sclerosis Multiple sclerosis is an autoimmune dysfunction in which the T-cells of the immune system start attacking the body’s own nerve cells. The T-cells recognize the myelin sheath protein of neurons as foreign. Show movie

18 A hypothesis: Possibly, the myelin sheath proteins identified by the T-cells were similar to bacterial/viral sheath proteins from an earlier infection. How to test this hypothesis? Use sequence alignment. Why does this happen? Sequence Database Query Myelin sheath proteins List of similar bacterial/viral sequences. Response Identification of cause of immune dysfunction Lab tests

19 Dynamic Programming It is a way of solving problems (involving recurrence relations) by storing partial results. Consider the Fibonacci Series: F(n) = F(n-1) + F(n-2) F(0) = 0, F(1) = 1 A recursive algorithm will take exponential time to find F(n) A Dynamic Prog. based solution takes only n steps (linear time)

20 Needleman-Wunsch algorithm F(i,j) = Maximum of F(i-1, j-1) + s(x[i], y[j]) F(i-1, j) – d F(i, j-1) - d F(i-1,j-1)F(i, j-1) F(i-1, j) F(i,j) -d +s (X[i],Y[j]) Assume that match = 1, mismatch = 0, indel = 0

21 Needleman-Wunsch example 000000000000 0 0 0 0 0 0 0 GTCAGTTATAA G G A T C G A

22 000000000000 01 0 0 0 0 0 0 GTCAGTTATAA G G A T C G A

23 000000000000 011111111111 0 0 0 0 0 0 GTCAGTTATAA G G A T C G A

24 000000000000 011111111111 011111112222 012222222223 012233333333 012233444444 012233445555 012333455556 GTCAGTTATAA G G A T C G A

25 000000000000 011111111111 011111112222 012222222223 012233333333 012233444444 012233445555 012333455556 GTCAGTTATAA G G A T C G A Traceback

26 000000000000 011111111111 011111112222 012222222223 012233333333 012233444444 012233445555 012333455556 GTCAGTTATAA G G A T C G A

27 000000000000 011111111111 011111112222 012222222223 012233333333 012233444444 012233445555 012333455556 GTCAGTTATAA G G A T C G A

28 000000000000 011111111111 011111112222 012222222223 012233333333 012233444444 012233445555 012333455556 GTCAGTTATAA G G A T C G A

29 000000000000 011111111111 011111112222 012222222223 012233333333 012233444444 012233445555 012333455556 GTCAGTTATAA G G A T C G A

30 000000000000 011111111111 011111112222 012222222223 012233333333 012233444444 012233445555 012333455556 GTCAGTTATAA G G A T C G A

31 000000000000 011111111111 011111112222 012222222223 012233333333 012233444444 012233445555 012333455556 GTCAGTTATAA G G A T C G A

32 000000000000 011111111111 011111112222 012222222223 012233333333 012233444444 012233445555 012333455556 GTCAGTTATAA G G A T C G A

33 000000000000 011111111111 011111112222 012222222223 012233333333 012233444444 012233445555 012333455556 GTCAGTTATAA G G A T C G A

34 000000000000 011111111111 011111112222 012222222223 012233333333 012233444444 012233445555 012333455556 GTCAGTTATAA G G A T C G A

35 000000000000 011111111111 011111112222 012222222223 012233333333 012233444444 012233445555 012333455556 GTCAGTTATAA G G A T C G A

36 000000000000 011111111111 011111112222 012222222223 012233333333 012233444444 012233445555 012333455556 GTCAGTTATAA G G A T C G A

37 000000000000 011111111111 011111112222 012222222223 012233333333 012233444444 012233445555 012333455556 GTCAGTTATAA G G A T C G A

38 The solution Optimal alignment has a score of 6. G_AATTCAGTTA GGA_T_C_G__A

39 Linear Space Alignment Serafim talked about the Myers-Miller algorithm in class. There is another variant of the Hirschberg algorithm, given in Durbin (Pg 35).

40 Suppose we know that characters X[i] and Y[j] are aligned to each other in the optimal alignment of X[1..n] and Y[1..m]. How can we compute the alignment using this information? We can partition the alignment into two parts, align X[1..i-1] with Y[1..j-1] and X[i+1..n] with Y[j+1..m] separately.

41 Middle column

42

43 F(i,j) Middle column

44 F(i,j) Middle column

45 F(i,j) Middle column

46 F(i,j) Middle column

47 F(i,j) Middle column This is the cell in the middle column from where the traceback leaves the column. Maintain the coordinates of that cell with the value of F(i,j) Call it c(i,j)

48 For every cell in the right half of the matrix, Maintain the F(i,j) value. Maintain the coordinates of the cell in the middle column from where its traceback path leaves the middle column. Call it c(i, j). Maintain the direction of that jump as given by the pointer (either or ). Call it P(i,j).

49 If (i’,j’) is the cell preceding to (i,j), from which F(i,j) is derived, then c(i,j) = c(i’,j’) and P(i,j) = P(i’,j’) We need only linear space to compute the F,c and P values as we proceed across the matrix.

50 F(i’,j’) c(i’,j’) F(i,j) c(i,j) Middle column We know the traceback from (i’,j’) leaves the middle column at this cell Hence, the traceback from this cell will also have the same c(i,j) value We are interested in the value of c(n.m)

51 We use the c(n,m) and P(n,m) values to split the dynamic programming matrix into two parts. How? Because we know one aligned pair of letters in the optimal alignment now.


Download ppt "CS 262 Discussion Section 1. Purpose of discussion sections To clarify difficulties/ambiguities in the problem set questions and lecture material. To."

Similar presentations


Ads by Google