# From Pairwise to Multiple Alignment. WHATS TODAY? Multiple Sequence Alignment- CLUSTAL MOTIF search.

## Presentation on theme: "From Pairwise to Multiple Alignment. WHATS TODAY? Multiple Sequence Alignment- CLUSTAL MOTIF search."— Presentation transcript:

From Pairwise to Multiple Alignment

WHATS TODAY? Multiple Sequence Alignment- CLUSTAL MOTIF search

Multiple Sequence Alignment MSA

VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWWSNG-- Like pairwise alignment BUT compare n sequences instead of 2 Rows represent individual sequences Columns represent ‘same’ position Gaps allowed in all sequences

How to find the best MSA GTCGTAGTCG-GC-TCGAC GTC-TAG-CGAGCGT-GAT GC-GAAG-AG-GCG-AG-C GCCGTCG-CG-TCGTA-AC GTCGTAGTCGGCTCGAC GTCTAGCGAGCGTGAT GCGAAGAGGCGAGC GCCGTCGCGTCGTAAC 1*1 2*0.75 11*0.5 Score=8 4*1 11*0.75 2*0.5 Score=13.25 Score : 4/4 =1, 3/4 =0.75, 2/4=0.5, 1/4= 0

Alignment of 3 sequences: Complexity: length A  length B  length C Aligning 100 proteins, 1000 amino acids each Complexity: 10 300 table cells Calculation time: beyond the big bang!

Feasible Approach Based on pairwise alignment scores –Build n by n table of pairwise scores Align similar sequences first –After alignment, consider as single sequence –Continue aligning with further sequences Progressive alignment (Feng & Doolittle).

–For n sequences, there are n  (n-1)/2 pairs GTCGTAGTCG-GC-TCGAC GTC-TAG-CGAGCGT-GAT GC-GAAG-AG-GCG-AG-C GCCGTCG-CG-TCGTA-AC

1 GTCGTAGTCG-GC-TCGAC 2 GTC-TAG-CGAGCGT-GAT 3 GC-GAAGAGGCG-AGC 4 GCCGTCGCGTCGTAAC 1 GTCGTA-GTCG-GC-TCGAC 2 GTC-TA-G-CGAGCGT-GAT 3 G-C-GAAGA-G-GCG-AG-C 4 G-CCGTCGC-G-TCGTAA-C

CLUSTAL method Higgins and Sharp 1988 –ref: CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 73, 237–244. [Medline][Medline] An approximation strategy (heuristic algorithm) yields a possible alignment, but not necessarily the best one Applies Progressive Sequence Alignment

Treating Gaps in CLUSTAL Penalty for opening gaps and additional penalty for extending the gap Gaps found in initial alignment remain fixed New gaps are introduced as more sequences are added (decreased penalty if gap exists)

Other MSA Approaches Progressive approach CLUSTALW PILEUP T-COFFEE Iterative approach: Repeatedly realign subsets of sequences. MultAlin, DiAlign. Statistical Methods: Hidden Markov Models (only for proteins) SAM2K, MUSCLE Genetic algorithm SAGA

Links to commonly used MSA tools CLUSTALW (recommended for DNA/RNA) http://www.ebi.ac.uk/Tools/clustalw2/ T-COFFEE http://www.ebi.ac.uk/t-coffee/ MUSCLE (recommended for proteins) http://www.ebi.ac.uk/muscle/ MAFFT http://www.ebi.ac.uk/mafft/ Kalign http://www.ebi.ac.uk/kalign/

CAUTION !!! Different tools may give different results

Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

DNA Regulatory Motifs Transcription Factors bind to regulatory motifs –TF binding motifs are usually 6 – 20 nucleotides long –Usually located near target gene, mostly upstream the transcription start site Transcription Start Site SBF motif MCM1 motif Gene X MCM1 SBF

Identification of Known Motifs within Genomic Sequences Main Motivation: - Identifying the target of regulatory proteins (e.g. Transcription Factors) in the cell IN MANY CANCERS TRANSCRIPTION FACTORS ARE KNOWN TO BE MUTATED. WE WANT TO KNOW WHO ARE THE TARGETS OF THESE PROTEINS

P53 the guardian of the cell

How can we start looking for p53 (on any other transcription factor) targets using bioinformatics? Scenario 1 : Binding motif is known (easier case) Scenario 2 : Binding motif is unknown (hard case)

Challenges How to recognize a regulatory motif? Can we identify new occurrences of known motifs in genome sequences? Can we discover new motifs within upstream sequences of genes?

Scenario 1 : Binding motif is known

The E-coli promoter

1. Motif Representation Exact motif: CGGATATA Consensus: represent only deterministic nucleotides. –Example: HAP1 binding sites in 5 sequences. consensus motif: CGGNNNTANCGG N stands for any nucleotide. Representing only consensus loses information. How can this be avoided? CGGATATACCGG CGGTGATAGCGG CGGTACTAACGG CGGCGGTAACGG CGGCCCTAACGG ------------ CGGNNNTANCGG

TTGACA -35 TATAAT -10 Transcription start site Representing the motif as a profile -35-10 A T G C 1 23456 A T G C 1 23456 Based on ~450 known promoters 0.1 0.1 0.1 0.5 0.2 0.5 0.7 0.7 0.2 0.2 0.2 0.2 0.1 0.1 0.5 0.1 0.1 0.2 0.1 0.1 0.2 0.2 0.5 0.1 0.1 0.7 0.2 0.6 0.5 0.1 0.7 0.1 0.5 0.2 0.2 0.8 0.1 0.1 0.1 0.1 0.1 0.0 0.1 0.1 0.2 0.1 0.1 0.1

12345 A102557060 C3025801015 T50255105 G 2510 20 PSPM – Position Specific Probability Matrix Represents a motif of length k (5) Count the number of occurrence of each nucleotide in each position

12345 A0.10.250.050.70.6 C0.30.250.80.10.15 T0.50.250.050.10.05 G0.10.250.1 0.2 PSPM – Position Specific Probability Matrix Defines P i {A,C,G,T} for i={1,..,k}. –P i (A) – frequency of nucleotide A in position i.

12345 A0.10.250.050.70.6 C0.30.250.80.10.15 T0.50.250.050.10.05 G0.10.250.1 0.2 PSPM – Position Specific Probability Matrix Each k-mer is assigned a probability. –Example: P(TCCAG)=0.5*0.25*0.8*0.7*0.2

12345 A0.10.250.050.70.6 C0.30.250.80.10.15 T0.50.250.050.10.05 G0.10.250.1 0.2 Detecting a Known Motif within a Sequence using PSPM The PSPM is moved along the query sequence. At each position the sub-sequence is scored for a match to the PSPM. Example: sequence = ATGCAAGTCT…

The PSPM is moved along the query sequence. At each position the sub-sequence is scored for a match to the PSPM. Example: sequence = ATGCAAGTCT… Position 1: ATGCA 0.1*0.25*0.1*0.1*0.6=1.5*10 -4 12345 A0.10.250.050.70.6 C0.30.250.80.10.15 T0.50.250.050.10.05 G0.10.250.1 0.2 Detecting a Known Motif within a Sequence using PSPM

The PSPM is moved along the query sequence. At each position the sub-sequence is scored for a match to the PSPM. Example: sequence = ATGCAAGTCT… Position 1: ATGCA 0.1*0.25*0.1*0.1*0.6=1.5*10 -4 Position 2: TGCAA 0.5*0.25*0.8*0.7*0.6=0.042 12345 A0.10.250.050.70.6 C0.30.250.80.10.15 T0.50.250.050.10.05 G0.10.250.1 0.2 Detecting a Known Motif within a Sequence using PSPM

Detecting a Known Motif within a Sequence using PSSM Is it a random match, or is it indeed an occurrence of the motif? PSPM -> PSSM (Probability Specific Scoring Matrix) –odds score : O i (n) where n  {A,C,G,T} for i={1,..,k} –defined as P i (n)/P(n), where P(n) is background frequency. O i (n) increases => higher odds that n at position i is part of a real motif.

12345 A0.10.250.050.70.6 12345 A0.410.22.82.4 12345 A-1.3220-2.3221.4851.263 PSSM as Odds Score Matrix Assumption: the background frequency of each nucleotide is 0.25. 1.Original PSPM (P i ): 2.Odds Matrix (O i ): 3.Going to log scale we get an additive score, Log odds Matrix (log 2 O i ):

12345 A-1.320-2.321.481.26 C0.2601.68-1.32-0.74 T10-2.32-1.32-2.32 G-1.320 -0.32 Calculating using Log Odds Matrix Odds  0 implies random match; Odds > 0 implies real match (?). Example: sequence = ATGCAAGTCT… Position 1: ATGCA -1.32+0-1.32-1.32+1.26=-2.7 odds= 2 -2.7 =0.15 Position 2: TGCAA 1+0+1.68+1.48+1.26 =5.42 odds=2 5.42 =42.8

Calculating the probability of a Match ATGCAAG Position 1 ATGCA = 0.15

Calculating the probability of a Match ATGCAAG Position 1 ATGCA = 0.15 Position 2 TGCAA = 42.3

Calculating the probability of a Match ATGCAAG Position 1 ATGCA = 0.15 Position 2 TGCAA = 42.3 Position 3 GCAAG =0.18

Calculating the probability of a match ATGCAAG Position 1 ATGCA = 0.15 Position 2 TGCAA = 42.3 Position 3 GCAAG =0.18 P (i) = S / (∑ S) Example 0.15 /(.15+42.8+.18)=0.003 P (1)= 0.003 P (2)= 0.993 P (3) =0.004

Building a PSSM for short motifs Collect all known sequences that bind a certain TF. Align all sequences (using multiple sequence alignment). Compute the frequency of each nucleotide in each position (PSPM). Incorporate background frequency for each nucleotide (PSSM).

Graphical Representation – Sequence Logo Horizontal axis: position of the base in the sequence. Vertical axis: amount of information (bits). Letter stack: order indicates importance. Letter height: indicates frequency. Consensus can be read across the top of the letter columns.

http://weblogo.berkeley.edu WebLogo - Input

Genes: WebLogo - Output Proteins:

Scenario 2 : Binding motif is unknown

Finding new Motifs We are given a group of genes, which presumably contain a common regulatory motif. We know nothing of the TF that binds to the putative motif. The problem: discover the motif.

Motif Discovery Motif Discovery

Example Predicting the cAMP Receptor Protein (CRP) binding site motif

Extract experimentally defined CRP Binding Sites GGATAACAATTTCACA AGTGTGTGAGCGGATAACAA AAGGTGTGAGTTAGCTCACTCCCC TGTGATCTCTGTTACATAG ACGTGCGAGGATGAGAACACA ATGTGTGTGCTCGGTTTAGTTCACC TGTGACACAGTGCAAACGCG CCTGACGGAGTTCACA AATTGTGAGTGTCTATAATCACG ATCGATTTGGAATATCCATCACA TGCAAAGGACGTCACGATTTGGG AGCTGGCGACCTGGGTCATG TGTGATGTGTATCGAACCGTGT ATTTATTTGAACCACATCGCA GGTGAGAGCCATCACAG GAGTGTGTAAGCTGTGCCACG TTTATTCCATGTCACGAGTGT TGTTATACACATCACTAGTG AAACGTGCTCCCACTCGCA TGTGATTCGATTCACA

Create a Multiple Sequence Alignment GGATAACAATTTCACA TGTGAGCGGATAACAA TGTGAGTTAGCTCACT TGTGATCTCTGTTACA CGAGGATGAGAACACA CTCGGTTTAGTTCACC TGTGACACAGTGCAAA CCTGACGGAGTTCACA AGTGTCTATAATCACG TGGAATATCCATCACA TGCAAAGGACGTCACG GGCGACCTGGGTCATG TGTGATGTGTATCGAA TTTGAACCACATCGCA GGTGAGAGCCATCACA TGTAAGCTGTGCCACG TTTATTCCATGTCACG TGTTATACACATCACT CGTGCTCCCACTCGCA TGTGATTCGATTCACA

XXXXXTGTGAXXXXAXTCACAXXXXXXX XXXXXACACTXXXXTXAGTGTXXXXXXX Generate a PSSM

PROBLEMS… When searching for a motif in a genome using PSSM or other methods – the motif is usually found all over the place ->The motif is considered real if found in the vicinity of a gene. Checking experimentally for the binding sites of a specific TF (location analysis) – the sites that bind the motif are in some cases similar to the PSSM and sometimes not!

Computational Methods This problem has received a lot of attention from CS people. Methods include: –Probabilistic methods – hidden Markov models (HMMs), expectation maximization (EM), Gibbs sampling, etc. –Enumeration methods – problematic for inexact motifs of length k>10. … Current status: Problem is still open.

Tools on the Web MEME – Multiple EM for Motif Elicitation. http://meme.sdsc.edu/meme/website/ http://meme.sdsc.edu/meme/website/ metaMEME- Uses HMM method http://meme.sdsc.edu/meme MAST-Motif Alignment and Search Tool http://meme.sdsc.edu/meme TRANSFAC - database of eukaryotic cis-acting regulatory DNA elements and trans-acting factors. http://transfac.gbf.de/TRANSFAC/ http://transfac.gbf.de/TRANSFAC/ eMotif - allows to scan, make and search for motifs at the protein level. http://motif.stanford.edu/emotif/

Download ppt "From Pairwise to Multiple Alignment. WHATS TODAY? Multiple Sequence Alignment- CLUSTAL MOTIF search."

Similar presentations