# Random Projection Approach to Motif Finding Adapted from RandomProjections.ppt.

## Presentation on theme: "Random Projection Approach to Motif Finding Adapted from RandomProjections.ppt."— Presentation transcript:

Random Projection Approach to Motif Finding Adapted from http://genome.ucsd.edu/classes/be202/ppt/FindingSignals- RandomProjections.ppt

daf-19 Binding Sites in C. elegans (Peter Swoboda) GTTGTCATGGTGAC GTTTCCATGGAAAC GCTACCATGGCAAC GTTACCATAGTAAC GTTTCCATGGTAAC che-2 daf-19 osm-1 osm-6 F02D8.3 -150

Algorithmic Techniques MEME (Expectation Maximization) GibbsDNA (Gibbs Sampling) CONSENUS (greedy multiple alignment) WINNOWER (Clique finding in graphs) SP-STAR (Sum of pairs scoring) MITRA (Mismatch trees to prune exhaustive search space)

The (l,d) Planted Motif Problem (Sagot 1998, Pevzner & Sze 2000) Generate a random length l consensus sequence C. Generate 20 instances, each differing from C by d random mutations. Plant one at a random position in each of N=20 random sequences of length n=600. Can you find the planted instances?

Planted Motifs AGTTATCGCGGCACAGGCTCCTTCTTTATAGCC ATGATAGCATCAACCTAACCCTAGATATGGGAT TTTTGGGATATATCGCCCCTACACTGGATGACT GGATATACATGAACACGGTGGGAAAACCCTGAC Each instance differs from ACAGGATCA by 2 mutations Remaining sequence random

Random Projection Algorithm Buhler and Tompa (2001) Guiding principle: Some instances of a motif agree on a subset of positions. Use information from multiple motif instances to construct model. ATGCGTC...ccATCCGACca......ttATGAGGCtc......ctATAAGTCgc......tcATGTGACac... (7,2) motif x(1) x(2) x(5) x(8) =M

k-Projections Choose k positions in string of length l. Concatenate nucleotides at chosen k positions to form k-tuple. In l-dimensional Hamming space, projection onto k dimensional subspace. ATGGCATTCAGATTC TGCTGAT l = 15 k = 7 P P = (2, 4, 5, 7, 11, 12, 13)

Random Projection Algorithm Choose a projection by selecting k positions uniformly at random. For each l-tuple in input sequences, hash into bucket based on letters at k selected positions. Recover motif from bucket containing multiple l-tuples. Bucket TGCT TGCACCT Input sequence x(i): …TCAATGCACCTAT...

Example l = 7 (motif size), k = 4 (projection size) Choose projection (1,2,5,7) GCTC...TAGACATCCGACTTGCCTTACTAC... Buckets Input Sequence ATGC ATCCGAC GCCTTAC

Hashing and Buckets Hash function h(x) obtained from k positions of projection. Buckets are labeled by values of h(x). Enriched buckets: contain at least s l-tuples, for some parameter s. ATTCCATCGCTC ATGC

Motif Refinement How do we recover the motif from the sequences in the enriched buckets? k nucleotides are known from hash value of bucket. Use information in other l-k positions as starting point for local refinement scheme, e.g. EM or Gibbs sampler Local refinement algorithm ATGCGTC Candidate motif ATGC ATCCGAC ATGAGGC ATAAGTC ATGTGAC

Frequency Matrix Model from Bucket Frequency matrix W ATGC ATCCGAC ATGAGGC ATAAGTC ATGTGAC Refined matrix W* EM algorithm

Motif Finding as Global Optimization Scoring function (Hamming distance, likelihood ratio, etc.) Many existing algorithms (MEME, GibbsDNA) are good local optimization routines. Random projection is a procedure for finding good starting points.

EM Motif Refinement For each bucket h containing more than s sequences, form weight matrix W h Use EM algorithm with starting point W h to obtain refined weight matrix model W h * For each input sequence x(i), return l tuple y(i) which maximizes likelihood ratio: Pr(y(i) | W h * )/ Pr(y(i) | P 0 ). T = {y(1), y(2), …, y(N)} C( T ) = consensus string

Expectation Maximization (EM) S = { x(1), …, x(N)} : set of input sequences Given: W = An initial probabilistic motif model P 0 = background probability distribution. Find value W max that maximizes likelihood ratio: EM is local optimization scheme. Requires starting value W

A Single Iteration Choose a random k-projection. Hash each l-mer x in input sequence into bucket labelled by h(x). From each bucket B with at least s sequences, form weight matrix model, and perform EM/Gibbs sampler refinement. Candidate motif is the best one found from refinement of all enriched buckets.

What is the best motif? Compute score S for each motif: –Generate W, an initial PSSM from the returned l-mers {y(1), y(2), …, y(N)} Return motif with maximal score

Parameter Selection Projection size k Choose k small so several motif instances hash to same bucket. (k < l - d) Choose k large to avoid contamination by spurious l-mers. E > (N (n - l + 1))/ 4 k Bucket threshold s: (s = 3, s = 4)

How Many Iterations? Planted bucket : bucket with hash value h(M), where M is motif. Choose m = number of iterations, such that Pr(planted bucket contains ≥ s sequences in at least one of m iterations) ≥ 0.95. Probability is readily computable since iterations form a sequence of independent Bernoulli trials.

Examples K = set of nt. in motif instances. P = set of nt. in positions predicted by algorithm.