From Pairwise to Multiple Alignment. WHATS TODAY? Multiple Sequence Alignment- CLUSTAL MOTIF search.

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules.
1 Multiple sequence alignment Lesson 4. 2 VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG.
From Pairwise to Multiple Alignment. WHATS TODAY? Multiple Sequence Alignment- CLUSTAL MOTIF search.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Transcription factor binding motifs (part I) 10/17/07.
Tutorial 5 Motif discovery.
Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)
An analysis of “Alignments anchored on genomic landmarks can aid in the identification of regulatory elements” by Kannan Tharakaraman et al. Sarah Aerni.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Multiple sequence alignments and motif discovery Tutorial 5.
Multiple sequence alignment
Similar Sequence Similar Function Charles Yan Spring 2006.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Sequence Databases As DNA and protein sequences accumulate, they are deposited in public databases. One of the most popular of these is GenBank, which.
Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.
Ab initio motif finding
Multiple Sequence Alignments
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Bioinformatics Sequence Analysis III
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Multiple Sequence Alignment BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Protein Sequence Alignment and Database Searching.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Motif Finding PSSMs Expectation Maximization Gibbs Sampling.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.
Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Motif discovery and Protein Databases Tutorial 5.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Construction of Substitution matrices
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
Protein Sequence Alignment Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Transcription factor binding motifs (part II) 10/22/07.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
FINAL PROJECT- Key dates
Finding regulatory modules
Presentation transcript:

From Pairwise to Multiple Alignment

WHATS TODAY? Multiple Sequence Alignment- CLUSTAL MOTIF search

Multiple Sequence Alignment MSA

VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWWSNG-- Like pairwise alignment BUT compare n sequences instead of 2 Rows represent individual sequences Columns represent ‘same’ position Gaps allowed in all sequences

How to find the best MSA GTCGTAGTCG-GC-TCGAC GTC-TAG-CGAGCGT-GAT GC-GAAG-AG-GCG-AG-C GCCGTCG-CG-TCGTA-AC GTCGTAGTCGGCTCGAC GTCTAGCGAGCGTGAT GCGAAGAGGCGAGC GCCGTCGCGTCGTAAC 1*1 2* *0.5 Score=8 4*1 11*0.75 2*0.5 Score=13.25 Score : 4/4 =1, 3/4 =0.75, 2/4=0.5, 1/4= 0

Alignment of 3 sequences: Complexity: length A  length B  length C Aligning 100 proteins, 1000 amino acids each Complexity: table cells Calculation time: beyond the big bang!

Feasible Approach Based on pairwise alignment scores –Build n by n table of pairwise scores Align similar sequences first –After alignment, consider as single sequence –Continue aligning with further sequences Progressive alignment (Feng & Doolittle).

–For n sequences, there are n  (n-1)/2 pairs GTCGTAGTCG-GC-TCGAC GTC-TAG-CGAGCGT-GAT GC-GAAG-AG-GCG-AG-C GCCGTCG-CG-TCGTA-AC

1 GTCGTAGTCG-GC-TCGAC 2 GTC-TAG-CGAGCGT-GAT 3 GC-GAAGAGGCG-AGC 4 GCCGTCGCGTCGTAAC 1 GTCGTA-GTCG-GC-TCGAC 2 GTC-TA-G-CGAGCGT-GAT 3 G-C-GAAGA-G-GCG-AG-C 4 G-CCGTCGC-G-TCGTAA-C

CLUSTAL method Higgins and Sharp 1988 –ref: CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 73, 237–244. [Medline][Medline] An approximation strategy (heuristic algorithm) yields a possible alignment, but not necessarily the best one Applies Progressive Sequence Alignment

Treating Gaps in CLUSTAL Penalty for opening gaps and additional penalty for extending the gap Gaps found in initial alignment remain fixed New gaps are introduced as more sequences are added (decreased penalty if gap exists)

Other MSA Approaches Progressive approach CLUSTALW (CLUSTALX) PILEUP T-COFFEE Iterative approach: Repeatedly realign subsets of sequences. MultAlin, DiAlign. Statistical Methods: Hidden Markov Models (only for proteins) SAM2K, MUSCLE Genetic algorithm SAGA

Links to commonly used MSA tools CLUSTALW T-COFFEE MUSCLE MAFFT Kalign

CAUTION !!! Different tools may give different results

Example : 7 different alignment tools produced 6 different Estimated evolution trees Wong et al., Science 319, January 2008

Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

DNA Regulatory Motifs Transcription Factors bind to regulatory motifs –TF binding motifs are usually 6 – 20 nucleotides long –Usually located near target gene, mostly upstream the transcription start site Transcription Start Site SBF motif MCM1 motif Gene X MCM1 SBF

Why are motifs interesting? A motif is evidence of binding A motif can help in developing hypotheses regarding which protein is regulating the expression of a specific genes Mutations at particular regulatory sites can lead to disease

Challenges How to recognize a regulatory motif? Can we identify new occurrences of known motifs in genome sequences? Can we discover new motifs within upstream sequences of genes?

E. Coli promoter sequences

1. Motif Representation Exact motif: CGGATATA Consensus: represent only deterministic nucleotides. –Example: HAP1 binding sites in 5 sequences. consensus motif: CGGNNNTANCGG N stands for any nucleotide. Representing only consensus loses information. How can this be avoided? CGGATATACCGG CGGTGATAGCGG CGGTACTAACGG CGGCGGTAACGG CGGCCCTAACGG CGGNNNTANCGG

TTGACA -35 TATAAT -10 Transcription start site Representing the motif as a profile A T G C A T G C Based on ~450 known promoters

12345 A C T G PSPM – Position Specific Probability Matrix Represents a motif of length k (5) Count the number of occurrence of each nucleotide in each position

12345 A C T G PSPM – Position Specific Probability Matrix Defines P i {A,C,G,T} for i={1,..,k}. –P i (A) – frequency of nucleotide A in position i.

Identification of Known Motifs within Genomic Sequences Motivation: –identification of new genes controlled by the same TF. –Infer the function of these genes. –Enable better understanding of the regulation mechanism.

12345 A C T G PSPM – Position Specific Probability Matrix Each k-mer is assigned a probability. –Example: P(TCCAG)=0.5*0.25*0.8*0.7*0.2

12345 A C T G Detecting a Known Motif within a Sequence using PSPM The PSPM is moved along the query sequence. At each position the sub-sequence is scored for a match to the PSPM. Example: sequence = ATGCAAGTCT…

The PSPM is moved along the query sequence. At each position the sub-sequence is scored for a match to the PSPM. Example: sequence = ATGCAAGTCT… Position 1: ATGCA 0.1*0.25*0.1*0.1*0.6=1.5* A C T G Detecting a Known Motif within a Sequence using PSPM

The PSPM is moved along the query sequence. At each position the sub-sequence is scored for a match to the PSPM. Example: sequence = ATGCAAGTCT… Position 1: ATGCA 0.1*0.25*0.1*0.1*0.6=1.5*10 -4 Position 2: TGCAA 0.5*0.25*0.8*0.7*0.6= A C T G Detecting a Known Motif within a Sequence using PSPM

Detecting a Known Motif within a Sequence using PSSM Is it a random match, or is it indeed an occurrence of the motif? PSPM -> PSSM (Probability Specific Scoring Matrix) –odds score : O i (n) where n  {A,C,G,T} for i={1,..,k} –defined as P i (n)/P(n), where P(n) is background frequency. O i (n) increases => higher odds that n at position i is part of a real motif.

12345 A A A PSSM as Odds Score Matrix Assumption: the background frequency of each nucleotide is Original PSPM (P i ): 2.Odds Matrix (O i ): 3.Going to log scale we get an additive score, Log odds Matrix (log 2 O i ):

12345 A C T G Calculating using Log Odds Matrix Odds  0 implies random match; Odds > 0 implies real match (?). Example: sequence = ATGCAAGTCT… Position 1: ATGCA =-2.7 odds= =0.15 Position 2: TGCAA =5.42 odds= =42.8

Calculating the probability of a Match ATGCAAG Position 1 ATGCA = 0.15

Calculating the probability of a Match ATGCAAG Position 1 ATGCA = 0.15 Position 2 TGCAA = 42.3

Calculating the probability of a Match ATGCAAG Position 1 ATGCA = 0.15 Position 2 TGCAA = 42.3 Position 3 GCAAG =0.18

Calculating the probability of a match ATGCAAG Position 1 ATGCA = 0.15 Position 2 TGCAA = 42.3 Position 3 GCAAG =0.18 P (i) = S / (∑ S) Example 0.15 /( )=0.003 P (1)= P (2)= P (3) =0.004

Building a PSSM for short motifs Collect all known sequences that bind a certain TF. Align all sequences (using multiple sequence alignment). Compute the frequency of each nucleotide in each position (PSPM). Incorporate background frequency for each nucleotide (PSSM).

Graphical Representation – Sequence Logo Horizontal axis: position of the base in the sequence. Vertical axis: amount of information (bits). Letter stack: order indicates importance. Letter height: indicates frequency. Consensus can be read across the top of the letter columns.

WebLogo - Input

Genes: WebLogo - Output Proteins:

Finding new Motifs We are given a group of genes, which presumably contain a common regulatory motif. We know nothing of the TF that binds to the putative motif. The problem: discover the motif.

Motif Discovery Motif Discovery

Example Predicting the cAMP Receptor Protein (CRP) binding site motif

Extract experimentally defined CRP Binding Sites GGATAACAATTTCACA AGTGTGTGAGCGGATAACAA AAGGTGTGAGTTAGCTCACTCCCC TGTGATCTCTGTTACATAG ACGTGCGAGGATGAGAACACA ATGTGTGTGCTCGGTTTAGTTCACC TGTGACACAGTGCAAACGCG CCTGACGGAGTTCACA AATTGTGAGTGTCTATAATCACG ATCGATTTGGAATATCCATCACA TGCAAAGGACGTCACGATTTGGG AGCTGGCGACCTGGGTCATG TGTGATGTGTATCGAACCGTGT ATTTATTTGAACCACATCGCA GGTGAGAGCCATCACAG GAGTGTGTAAGCTGTGCCACG TTTATTCCATGTCACGAGTGT TGTTATACACATCACTAGTG AAACGTGCTCCCACTCGCA TGTGATTCGATTCACA

Create a Multiple Sequence Alignment GGATAACAATTTCACA TGTGAGCGGATAACAA TGTGAGTTAGCTCACT TGTGATCTCTGTTACA CGAGGATGAGAACACA CTCGGTTTAGTTCACC TGTGACACAGTGCAAA CCTGACGGAGTTCACA AGTGTCTATAATCACG TGGAATATCCATCACA TGCAAAGGACGTCACG GGCGACCTGGGTCATG TGTGATGTGTATCGAA TTTGAACCACATCGCA GGTGAGAGCCATCACA TGTAAGCTGTGCCACG TTTATTCCATGTCACG TGTTATACACATCACT CGTGCTCCCACTCGCA TGTGATTCGATTCACA

XXXXXTGTGAXXXXAXTCACAXXXXXXX XXXXXACACTXXXXTXAGTGTXXXXXXX Generate a PSSM

PROBLEMS… When searching for a motif in a genome using PSSM or other methods – the motif is usually found all over the place ->The motif is considered real if found in the vicinity of a gene. Checking experimentally for the binding sites of a specific TF (location analysis) – the sites that bind the motif are in some cases similar to the PSSM and sometimes not!

Computational Methods This problem has received a lot of attention from CS people. Methods include: –Probabilistic methods – hidden Markov models (HMMs), expectation maximization (EM), Gibbs sampling, etc. –Enumeration methods – problematic for inexact motifs of length k>10. … Current status: Problem is still open.

Tools on the Web MEME – Multiple EM for Motif Elicitation. metaMEME- Uses HMM method MAST-Motif Alignment and Search Tool TRANSFAC - database of eukaryotic cis-acting regulatory DNA elements and trans-acting factors. eMotif - allows to scan, make and search for motifs at the protein level.