Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.

Similar presentations


Presentation on theme: "Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas."— Presentation transcript:

1 Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas

2 2 Motif consensus The consensus is the true underlying motif, that is expressed imperfectly in real genes because of mutations across organisms A motif instance is a particular realization of the motif consensus in a given gene; it will differ from the consensus in a small number of positions

3 3 Motif data example (made up) Motif instances: –AAAAACAC –CAAAACAA –ACACAAAA –CAAAAAAC –AAAGAACA –GACAAAAA –AAGAGAAA Motif consensus: AAAAAAAA

4 4 Motif data example (real) Positions 3-9 (out of about 22) of the cyclic AMP receptor protein transcription factor binding site in 20 samples –TTGTGGC –TTTTGAT –AAGTGTC –ATTTGCA –CTGTGAG –ATGCAAA –GTGTTAA –ATTTGAA –TTGTGAT –ATTTATT − ACGTGAT − ATGTGAG − TTGTGAG − CTGTAAC − CTGTGAA − TTGTGAC − GCCTGAC − TTGTGAT − GTGTGAA

5 5 Phylogenetic footprinting A phylogenetic tree organizes related (orthologous) sequences from different species The sequences appear as leaves Internal nodes indicate evolutionary divergence between species A footprint is a highly conserved region across species

6 6 Identifying footprints Main assumption: Functional DNA changes more slowly than other DNA Therefore, closely related regions in different species are –more likely to be functional sequences –a basis for grouping species together Footprints are DNA motifs

7 7 Phylogenetic footprinting example AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACAG... (Rabbit) GAACGGAGTACTG... (Mouse) TCGTGACGGTGAT... (Rat)

8 8 Phylogenetic footprinting example AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACAG... (Rabbit) GAACGGAGTACTG... (Mouse) TCGTGACGGTGAT... (Rat)

9 9 Phylogenetic footprinting example AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACAG... (Rabbit) GAACGGAGTACTG... (Mouse) TCGTGACGGTGAT... (Rat) ACGT

10 10 Phylogenetic footprinting example AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACAG... (Rabbit) GAACGGAGTACTG... (Mouse) TCGTGACGGTGAT... (Rat) ACGT ACGG

11 11 Phylogenetic footprinting example AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACAG... (Rabbit) GAACGGAGTACTG... (Mouse) TCGTGACGGTGAT... (Rat) ACGT ACGG ACG[TG]

12 12 Phylogenetic footprinting example AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACAG... (Rabbit) GAACGGAGTACTG... (Mouse) TCGTGACGGTGAT... (Rat) ACGT ACGG ACGT T→G mutation ACGT

13 13 Finding motifs Start with a number of related genes (or proteins) In regulatory motif finding, –the related genes are co-expressed Recall our discussion of DNA micro-arrays

14 14 The red part is a gene, the green line is the associated regulatory region in non-coding DNA, and the yellow boxes are the motif instances (unknown) Finding motifs: Start......

15 15 The red part is a gene, the green line is the associated regulatory region in non-coding DNA, and the yellow boxes are the motif instances (unknown) Finding motifs: Goal......

16 16 How does this relate to what we have discussed before? Motif finding a clear instance of a data mining problem Motif finding is equivalent to local alignment across multiple sequences Typically hundreds of sequences are aligned, sometimes thousands There are also corresponding biological problems for global alignment of multiple sequences

17 17 Multiple sequence alignment Protein families –Sets of proteins with similar structure (3D shape), function, or evolutionary history –Usually the above properties are correlated –Given several families, where to assign a new protein? DNA repeating sequences –ALU sequence in humans (300bp, appears more than 1 million times – 10% of our DNA) –Estimated 60% of the “junk” in human genome consists of such sequences

18 18 Optimal alignment We define the multiple global alignment as an extension of strings S 1, S 2,..., S k to S ′ 1, S′ 2,..., S′ k that may contain spaces with – |S′ 1 | = |S′ 2 | =... = |S′ k | – Removing all spaces from each S′ i leaves S i – No position has a space in all S′ i We need to extend our similarity function to handle multiple strings The optimal alignment is the one that maximizes the similarity function

19 19 Multiple string similarity Many ways to do so. Most common: Sum of pairwise similarities Assumes symmetric similarity We need to account for σ(-,-) (usually 0) Alternatively, we can use distances between strings and minimize the sum of the pairwise distances

20 20 Dynamic programming for multiple sequence alignment In pairwise alignment, we used a two- dimensional matrix to record three choices at each cell: {01}, {10}, and {11} where 1 means consume a character from the corresponding string

21 21 DP for multiple alignment For k strings we need a k-dimensional table Each dimension has as many elements as the length of the corresponding string plus one (for gaps at the start) Assuming the same length n, the matrix has (n+1) k cells At each cell, we consider 2 k – 1 choices

22 22 Multiple alignment complexity (n+1) k = O(n k ) entries need to be filled, each in O(2 k ) time Total time O(n k 2 k ) = O((2n) k ) Total space O(n k ) Typically n is a few thousand, k a few hundred making this approach impractical Independently of whether DP is used, for the sum of pairwise similarities the problem is provably NP-complete

23 23 What to do for NP-complete problems? Use exact methods (such as DP) for small inputs only Use approximate methods with polynomial time and a provable error bound Use heuristic approaches that follow plausible choices but have no guaranteed error bound –specific to the problem (such as FASTA) –general (optimization, estimation via statistical sampling such as MCMC)

24 24 Center star algorithm for multiple sequence global alignment T is the set of strings that we want to align Pick S  T that minimizes The initial alignment starts with S (≡S 1 ) Suppose we have already aligned S 1, S 2,..., S i as S ′ 1, S′ 2,..., S′ i. Then we add the remaining strings one at a time by aligning S i+1 with S′ 1, obtaining S′ i+1 and S′′ 1. We replace S′ 1 with S′′ 1 and add spaces to S′ 2,..., S′ i wherever spaces were added to S′ 1.


Download ppt "Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas."

Similar presentations


Ads by Google