Presentation is loading. Please wait.

Presentation is loading. Please wait.

CIS786, Lecture 7 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.

Similar presentations


Presentation on theme: "CIS786, Lecture 7 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David."— Presentation transcript:

1 http://creativecommons.org/licenses/by-sa/2.0/

2 CIS786, Lecture 7 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David La of California State University at Pomona

3 Previously…

4 Sequence alignments They tell us about Function or activity of a new gene/protein Structure or shape of a new protein Location or preferred location of a protein Stability of a gene or protein Origin of a gene or protein Origin or phylogeny of an organelle Origin or phylogeny of an organism And more…

5 Dynamic programming for pairwise alignment Time and space complexity is O(mn) Define V(i,j) to be the optimal pairwise alignment score between S 1..i and T 1..j (|S|=m, |T|=n)

6 Local alignment Finding optimally aligned local regions

7 BLAST Key idea: search for k-mers (short matchig substrings) quickly by preprocessing the database.

8 BLAST This key idea can also be used for speeding up pairwise alignments when doing multiple sequence alignments

9 Biologically realistic scoring matrices PAM and BLOSUM are most popular PAM was developed by Margaret Dayhoff and co-workers in 1978 by examining 1572 mutations between 71 families of closely related proteins BLOSUM is more recent and computed from blocks of sequences with sufficient similarity

10 Multiple sequence alignment “Two sequences whisper, multiple sequences shout out loud”---Arthur Lesk Computationally very hard---NP-hard

11 Multiple sequence alignment Unaligned sequences GGCTT TAGGCCTT TAGCCCTTA ACACTTC ACTT Aligned sequences _G_ _ GCTT_ TAGGCCTT_ TAGCCCTTA A_ _CACTTC A_ _C_ CTT_ Conserved regions help us to identify functionality

12 Sum of pairs score

13 Tree Alignment TAGGCCTT_ (Human) TAGCCCTTA (Monkey) A__C_CTT_ (Cat) A__CACTTC (Lion) _G__GCTT_ (Mouse) TAGGCCTT_A__CACTT_ TGGGGCTT_ AGGGACTT_ 02 2 1 1 3 3 2 Tree alignment score = 14

14 Profile alignment

15 Iterative alignment (heuristic for sum-of-pairs) Pick a random sequence from input set S Do (n-1) pairwise alignments and align to closest one t in S Remove t from S and compute profile of alignment While sequences remaining in S –Do |S| pairwise alignments and align to closest one t –Remove t from S

16 Progressive alignment Idea: perform profile alignments in the order dictated by a tree Given a guide-tree do a post-order search and align sequences in that order Widely used heuristic Can be used for solving tree alignment

17 Simultaneous alignment and phylogeny reconstruction Given unaligned sequences produce both alignment and phylogeny Known as the generalized tree alignment problem---MAX-SNP hard Iterative improvement heuristic: –Take starting tree –Modify it using say NNI, SPR, or TBR –Compute tree alignment score –If better then select tree otherwise continue until reached a local minimum

18 Median alignment Idea: iterate over the phylogeny and align every triplet of sequences---takes o(m 3 ) (in general for n sequences it takes O(2 n m n ) time Same profiles can be used as in progressive alignment Produces better tree alignment scores (as observed in experiments) Iteration continues for a specified limit

19 Popular alignment programs ClustalW: most popular, progressive alignment MUSCLE: fast and accurate, progressive and iterative combination T-COFFEE: slow but accurate, consistency based alignment (align sequences in multiple alignment to be close to the optimal pairwise alignment) PROBCONS: slow but highly accurate, probabilistic consistency progressive based scheme DIALIGN: very good for local alignments

20 Evaluation of multiple sequence alignments Compare to benchmark “true” alignments Use simulation Measure conservation of an alignment Measure accuracy of phylogenetic trees How well does it align motifs?

21 BAliBASE Most popular benchmark of alignments Alignments are based upon structure BAliBASE currently consists of 142 reference alignments, containing over 1000 sequences. Of the 200,000 residues in the database, 58% are defined within the core blocks. The remaining 42% are in ambiguous regions that cannot be reliably aligned. The alignments are divided into four hierarchical reference sets, reference 1 providing the basis for construction of the following sets. Each of the main sets may be further sub-divided into smaller groups, according to sequence length and percent similarity.

22 Comparison of alignments on BAliBASE

23 This time Parsimonious aligner Comparison of alignments under simulation Phylogenetic motifs Comparison of alignments for phylogenetic motif detection

24 Parsimonious aligner (PAl) 1.Construct progressive alignment A 2.Construct MP tree T on A 3.Construct progressive alignment A’ on guide-tree T 4.Set A=A’ and go to 3 5.Output alignment and tree with best MP score

25 PAl Faster than iterative improvement Speed and accuracy both depend upon progressive alignment and MP heuristic In practice MUSCLE and TNT are used for constructing alignments and MP trees How does PAl compare against traditional methods? PAl not designed for aligning structural regions but focuses on evolutionary conserved regions Let’s look at performance under simulation

26 Evaluating alignments under simulation We first need a way to evolve sequences with insertions and deletions NOTE: evolutionary models we have encountered so far do not account for insertions and deletions Not known exactly how to model insertions and deletions

27 ROSE Evolve sequences under an i.i.d. Markov Model Root sequence: probabilities given by a probability vector (for proteins default is Dayhoff et. al. values) Substitutions –Edge length are integers –Probability matrix M is given as input (default is PAM1*) –For edge of length b probabilty of x  y is given by M b xy Insertion and deletions: –Insertions and deletions follow the same probabilistic model –For each edge probability to insert is i ins. –Length of insertion is given by discrete probability distribution (normally exponential) –For edge of length b this is repeated b times. Model tree can be specified as input

28 Evaluation of alignments Let’s simulate alignments and phylogenies and compare them under simulation!!

29 Parameters for simulation study Model trees: uniform random distribution and uniformly selected random edge lengths Model of evolution: PAM with insertions and deletions probabilities selected from a gamma distribution (see ROSE software package) Replicate settings: Settings of 50, 100, and 400 taxa, mean sequence lengths of 200 and 500 and avg branch lengths of 10, 25, and 50 were selected. For each setting 10 datasets were produced

30 Phylogeny accuracy

31 Alignment accuracy

32 Running time

33 Conclusions DIALIGN seems to perform best followed by PAl, MUSCLE, and PROBCONS DIALIGN, however, is slower than PAl Does this mean DIALIGN is the best alignment program?

34 Conclusions DIALIGN seems to perform best followed by PAl, MUSCLE, and PROBCONS DIALIGN, however, is slower than PAl Does this mean DIALIGN is the best alignment program? Not necessarily: experiments were performed under uniform random trees with uniform random edge lengths. Not clear if this emulates the real deal.

35 Conclusions

36 Sum-of-pairs vs MP score

37

38 Conclusions Optimizing MP scores under this simulation model leads to better phylogenies and alignments

39 Conclusions Optimizing MP scores under this simulation model leads to better phylogenies and alignments What other models can we try?

40 Conclusions Optimizing MP scores under this simulation model leads to better phylogenies and alignments What other models can we try? Real data phylogenies as model trees Birth-death model trees Other distributions for model trees… Branch lengths: similar issues… Evolutionary model parameters estimated from real data

41 Evaluating alignments using motif detection Let’s evaluate alignments by searching for motifs If alignment X reveals more functional motifs than Y using technique Z then X is better than Y w.r.t. Z Motifs could be functional sites in proteins or functional regions in non-coding DNA

42 Protein Functional Site Prediction The identification of protein regions responsible for stability and function is an especially important post-genomic problem With the explosion of genomic data from recent sequencing efforts, protein functional site prediction from only sequence is an increasingly important bioinformatic endeavor.

43 What is a “Functional Site”? Defining what constitutes a “functional site” is not trivial Residues that include and cluster around known functionality are clear candidates for functional sites We define a functional site as catalytic residues, binding sites, and regions that clustering around them.

44 Protein

45 Protein + Ligand

46 Functional Sites (FS)

47 Regions that Cluster Around FS

48 Phylogenetic motifs PMs are short sequence fragments that conserve the overall familial phylogeny Are they functional? How do we detect them?

49 Phylogenetic motifs PMs are short sequence fragments that conserve the overall familial phylogeny Are they functional? How do we detect them? First we design a simple heuristic to find them Then we see if the detected sites are functional

50 Scan for Similar Trees Whole Tree

51 Scan for Similar Trees Whole Tree

52 Scan for Similar Trees Windowed Tree Whole Tree

53 Scan for Similar Trees Partition Metric Score: 6 Windowed Tree Whole Tree

54 Scan for Similar Trees Partition Metric Score: 8 Windowed Tree Whole Tree

55 Scan for Similar Trees Partition Metric Score: 4 Windowed Tree Whole Tree

56 Scan for Similar Trees Partition Metric Score: 6 Windowed Tree Whole Tree

57 Scan for Similar Trees Partition Metric Score: 8 Windowed Tree Whole Tree

58 Scan for Similar Trees Partition Metric Score: 6 Windowed Tree Whole Tree

59 Scan for Similar Trees Partition Metric Score: 6 Windowed Tree Whole Tree

60 Scan for Similar Trees Partition Metric Score: 0 Windowed Tree Whole Tree

61 Scan for Similar Trees Partition Metric Score: 6 Windowed Tree Whole Tree

62 Scan for Similar Trees Partition Metric Score: 6 Windowed Tree Whole Tree

63 Scan for Similar Trees Partition Metric Score: 8 Windowed Tree Whole Tree

64 Scan for Similar Trees Partition Metric Score: 0 Windowed Tree Whole Tree

65 Scan for Similar Trees Partition Metric Score: 6 Windowed Tree Whole Tree

66 Scan for Similar Trees Partition Metric Score: 6 Windowed Tree Whole Tree

67 Scan for Similar Trees Partition Metric Score: 6 Windowed Tree Whole Tree

68 Phylogenetic Motif Identification Compare all windowed trees with whole tree and keep track of the partition metric scores Compare all windowed trees with whole tree and keep track of the partition metric scores Normalize all partition metric scores by calculating z-scores Normalize all partition metric scores by calculating z-scores Call these normalized scores Phylogenetic Similarity Z-scores (PSZ) Call these normalized scores Phylogenetic Similarity Z-scores (PSZ) Set a PSZ threshold for identifying windows that represent phylogenetic motifs Set a PSZ threshold for identifying windows that represent phylogenetic motifs

69 Set PSZ Threshold

70 Regions of PMs

71 Map PMs to the Structure

72 Set PSZ Threshold

73 Map PMs to the Structure Map Set PSZ Threshold

74 Map PMs to the Structure Map Set PSZ Threshold

75 PMs in Various Structures

76 PMs and Traditional Motifs

77 TIM Phylogenetic Similarity False Positive Expectation

78 TIM Phylogenetic Similarity False Positive Expectation

79 TIM Phylogenetic Similarity False Positive Expectation

80 TIM Phylogenetic Similarity False Positive Expectation

81 Cytochrome P450 Phylogenetic Similarity False Positive Expectation

82 Cytochrome P450 Phylogenetic Similarity False Positive Expectation

83 Enolase Phylogenetic Similarity False Positive Expectation

84 Glycerol Kinase Phylogenetic Similarity False Positive Expectation

85 Glycerol Kinase Phylogenetic Similarity False Positive Expectation

86 Myoglobin Phylogenetic Similarity False Positive Expectation

87 Myoglobin Phylogenetic Similarity False Positive Expectation

88 Evaluating alignments For a given alignment compute the PMs Determine the number of functional PMs Those identifying more functional PMs will be classified as better alignments

89 Protein datasets

90 Running time

91 Functional PMs PAl=blue MUSCLE=red Both=green (a)=enolase, (b)ammonia channel, (c)=tri-isomerase, (d)=permease, (e)=cytochrome


Download ppt "CIS786, Lecture 7 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David."

Similar presentations


Ads by Google