Presentation is loading. Please wait.

Presentation is loading. Please wait.

Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.

Similar presentations

Presentation on theme: "Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16."— Presentation transcript:

1 Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16

2 Outline Basic concepts and biological problems. Regular expression for: –Pattern matching (sequence motifs), –Pattern discovery (promoter elements). Position Weight Matrix (PWM) for: –Pattern matching (TransFac, TESS, etc), –Pattern discovery (MEME, Gibbs sampling). Hidden Markov Models (HMMs) for protein domain analysis (next lecture).

3 Biological Sequence Patterns In nucleotide sequences: –Transcription start and termination sites, –Promoter cis regulatory elements, –Intron/exon splice sites, –Translation start and stop sites, –mRNA cis regulatory elements. In protein sequences: –Functional motifs such as signal peptides, –Conserved protein domains.

4 Promoter cis Regulatory Elements Cells respond to various stimuli by regulating the expression of particular genes. Transcription factors regulate gene expression by binding to specific DNA sequence motifs. Transcription factor binding sites are often short (5 – 25 bases) and degenerate DNA motifs. Co-regulated genes may have common regulatory motifs in their promoters. H2 H1 L H2 L H1 DNA MyoD HLH Dimer CAACTGAC

5 How to Represent a Sequence Pattern? Regular expressions: –A pattern is represented by a string of characters such as TATAAAA (the TATA box). –Ambiguous characters, wild-cards and gaps are allowed, but no position-specific information. Position Weight Matrices (PWM): –Also called Position-Specific Score Matrix (PSSM). –Often an ungapped pattern specified by a table. Stochastic models: –Hidden Markov Models (HMM), neural nets, etc. –Based on probability / machine learning theory.

6 Pattern Matching vs. Pattern Discovery Pattern matching: –Scanning a nucleotide or protein sequence for matches to a known pattern. –How to get better sensitivity and specificity is the major consideration. Pattern discovery: –Given a set of sequences, discovering a pattern that is shared by the sequences. It is unknown in advance about what is the pattern. –Using search or learning approaches. –A much harder problem than pattern matching.

7 Pattern Matching with RegExp Regular Expression (RegExp) can represent: –Ambiguous character: e.g., [AG] or R. –Wild-card: e.g., X for any amino acids. –Gap: e.g., x(i, j) in PROSITE patterns. Pattern matching with regular expression is straightforward, but sometimes very useful. For example, find all the Arabidopsis proteins which contain the following motif: [RK][LVI]X{5}[QH][LA] (These proteins may be targeted to peroxisome) Patmatch at TAIR (

8 Pattern Discovery Using RegExp Enumerate all the possible regular expression patterns with ambiguous characters. Count the occurrences of all the patterns in the input sequences (word counting). Compute statistical significance based on the background distribution. (The method works for simple patterns such as short nucleotide motifs, but not for long and/or complex patterns) e.g., CWTNC, CRTGTW, YCGGAYRRAWG, …… over {A, C, G, T, R, Y, S, W, M, K, V, H, D, B, N} e.g., z-score:

9 Applications to Promoter Analysis The RegExp pattern enumeration method has been used to find cis regulatory motifs that are statistically overrepresented in a given promoter sequence dataset: –Sinha and Tompa, 2002. Discovery of novel transcription factor binding sites by statistical overrepresentation. NAR, 30:5549-5560. –YMP is available at Complete search: all motifs in the search space are enumerated and tested for statistical overrepresentation.

10 Problems with RegExp Do not specify the relative frequencies of nucleotides at a position. Cannot express the relative importance of a position for the pattern. Cannot capture a possible relationship between two positions. A G T C C A G T A C A G T G G A A C T T A A G T T A R B N B

11 PWM Representation of a Motif A motif is assumed to have a fixed width, W. In the PWM, p nk is the probability (relative frequency) of nucleotide n in column k. Background probability: p n0 is the probability of n in the background (i.e., outside the motif). Equal distribution: p A0 = p C0 = p G0 = p T0 = ¼. 12345 A10.250 0 C00 0.125 0.250.5 G00.75 0.125 0.25 T000.750.25 Have we lost information here? AGTCC AGTAC AGTGG AACTT AAGTT

12 Visualization of PWM Patterns The pattern captured by an MSA or PWM may be visualized using a sequence logo.sequence logo Information Content (IC) of the nucleotide PWM at position k is: where p nk is the probability of n at position k. Assuming equal background probability for A, C, G and T (1/4).

13 Information Content (IC) IC is a measure of a site’s tolerance for substitution: high IC, low tolerance. If p A1 = 1, p C1 = 0, p G1 = 0, p T1 = 0, If p A4 = ¼, p C4 = ¼, p G4 = ¼, p T4 = ¼, 12345 A10.250 0 C00 0.125 0.250.5 G00.75 0.125 0.25 T000.750.25 AGTCC AGTAC AGTGG AACTT AAGTT

14 Pattern Matching with PWM Given a Position Weight Matrix (PWM) of a pattern, find all the occurrences of the pattern on the input sequence. Sliding window analysis: How to score a match? Sequence Match with the PWM p ck is the PWM entry at position k and corresponding to character c of the sequence, and q c is the background probability of c. (Often use log-odd score)

15 Resources for Promoter Analysis TransFac ( ): –A database on eukaryotic transcription factors (TF) and their DNA binding sites (PWMs). –Provide TF classification and search options. TESS (Transcription Element Search System at ?RQ=WELCOME ): ?RQ=WELCOME –A web tool for predicting TF binding sites. –Using PWMs from TransFac and others. SCPD ( –The promoter database of Saccharomyces cerevisiae. –Tools for site prediction and promoter retrieval.

16 Pattern Discovery Using PWM The Problem: –Given a set of unaligned sequences, discover a PWM pattern shared by the sequences. –The pattern locations on the sequences are also unknown in advance. Two sets of parameters to estimate (or learn): –PWM of a potential pattern. –Pattern offset matrix. Algorithmic approaches: –Expectation Maximization. –Gibbs sampling. Motif Sequences

17 Pattern Offset Matrix The element Z ij of the pattern offset matrix Z is the probability that the pattern (given in p) starts at position j of sequence i (X i ): The probability of a sequence X i with the pattern starting at j is: 12345 X1X1 0.1 X2X2 0.2 X3X3 X4X4 before motif motif after motif

18 Expectation Maximization (EM) Given: length W, sequence dataset set initial values for p do { re-estimate Z from p (E-step) re-estimate p from Z (M-step) } until (change in p < ε ) return p, Z p Z E M

19 More about the EM Algorithm EM is a heuristic algorithm for discovering PWM motifs shared by a set of sequences. EM converges to a local maximum in the likelihood of the data given the model p: EM usually converges in a small number of iterations. EM is sensitive to initial starting point (i.e., the initial values in p).

20 MEME MEME (Multiple EM for Motif Elicitation) is widely used for motif discovery. MEME is based on the EM algorithm with several extensions. MEME is available at The dataset contains 30 yeast promoters from a co-regulated gene cluster. These genes are mostly involved in respiration, and are co- regulated in various stress conditions. What is the TF binding site in the shared motif?TF binding site

21 The MEME Algorithm MEME (dataset, W, NSITES, PASSES) { for i = 1 to PASSES { for each subsequence in dataset { run EM for 1 iteration with starting point derived from this subsequence choose a motif model with the highest likelihood run EM to convergence from starting point which generated that model print converged model of the shared motif erase appearances of the motif from the dataset }

22 MEME Enhancements to the Basic EM Approach Trying many starting points by using every distinct subsequences of length w in the dataset. Not assuming that there is exactly one motif occurrence in every sequence. Allowing multiple motifs to be learned.

23 Gibbs Sampling For motif discovery, Gibbs sampling can be viewed as a stochastic analog of EM: –In the EM algorithm, we maintained a distribution Z i over the possible motif starting positions for each sequence; –In the Gibbs sampling approach, we maintain a specific starting position for each sequence, but keep re-sampling the starting positions. Gibbs sampling may be less susceptible to local minima than EM.

24 A Gibbs Sampling Algorithm Given: length W, sequence dataset choose random motif positions for a do { pick a sequence X i estimate p using motif positions in a (all sequences but X i ) (update step) sample a new motif position a i for X i (sampling step) } until (change in p < ε ) return p, a

25 Gibbs Motif Sampler and AlignACE Gibbs Motif Sampler: –Based on the work by Lawrence, et al. 1993. Science, 262:208-214. –Available at AlignACE: –Based on the Gibbs sampling algorithm with several extensions. –Available at

26 Summary For simple sequence patterns, regular expression is a useful tool. For some complex sequence patterns, position weight matrix (PWM) is preferred. Expectation Maximization (EM) and Gibbs sampling are two useful approaches for sequence pattern discovery. Next: protein domain analysis using HMM

27 Reading (Optional) Lawrence et al., 1993. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262:208-214. Eddy, 2004. What is a hidden Markov model? Nature Biotechnology, 22:1315-1316. Eddy, 1998. Multiple alignment and multiple sequence based searches. Trends Guide to Bioinformatics, 15-18.

28 For This Week’s Lab Collect a set of promoter sequences (10-500 sequences in FASTA format) from co-regulated or related genes. The promoter sequences should be the 500-1500 nucleotides upstream of the transcription start sites. Collect a set of protein sequences (10-50 sequences in FASTA format) from a gene family or superfamily.

Download ppt "Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16."

Similar presentations

Ads by Google