Transcription factor binding motifs

Transcription factor binding motifs
Prof. William Stafford Noble GENOME 541

Outline Representing motifs Motif discovery
Gibbs sampling MEME Scanning for motif occurrences Multiple testing correction redux

Motif (n): a succession of notes that has some special importance in or is characteristic of a composition

Motif (n): a recurring genomic sequence pattern
TCCACGGC

Sequence-specific transcription factors drive gene regulation

Motif discovery problem
Given sequences Find motif seq. 1 seq. 2 seq. 3 IGRGGFGEVY at position 515 LGEGCFGQVV at position 430 VGSGGFGQVY at position 682 seq. 1 seq. 2 seq. 3

Motif discovery problem (harder version)
Given: a sequence or family of sequences. Find: the number of motifs the width of each motif the locations of motif occurrences

Why is this hard? Input sequences are long (thousands or millions of residues). Motif may be subtle Instances are short. Instances are only slightly similar.

The most common model of sequence motifs is the position-specific scoring matrix
A C G T …

Log-odds score Estimate the probability of observing each amino acid.
The amino acid was generated by the foreground model (i.e., the PSSM). The amino acid “A” is observed. Estimate the probability of observing each amino acid. Divide by the background probability of observing the same amino acid. Take the log so that the scores are additive. The amino acid was generated by the background model (i.e., randomly selected).

Motif logos scale letters by relative entropy
Splice site motif pi, = probability of  in matrix position i b = background frequency of  CTCF binding motif

Gibbs sampling Lawrence et al. “Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment.” Science 1993

Alternating approach Guess an initial weight matrix
Use weight matrix to predict instances in the input sequences Use instances to predict a PSSM Repeat 2 & 3 until satisfied.

Initialization Randomly guess an instance si from each of t input sequences {S1, ..., St}. sequence 1 ACAGTGT TTAGACC GTGACCA ACCCAGG CAGGTTT sequence 2 sequence 3 sequence 4 sequence 5

Gibbs sampler Initially: randomly guess an instance si from each of t input sequences {S1, ..., St}. Steps 2 & 3 (search): Throw away an instance si: remaining (t - 1) instances define PSSM. PSSM defines instance probability at each position of input string Si Pick new si according to probability distribution Return highest-scoring motif seen

Sampler step illustration:
ACAGTGT TAGGCGT ACACCGT ??????? CAGGTTT A C G T .45 .05 .25 .65 .85 ACAGTGT TAGGCGT ACACCGT ACGCCGT CAGGTTT sequence 4 11% ACGCCGT:20% ACGGCGT:52%

MEME Bailey and Elkan. “Fitting a mixture model by expectation-maximization to discover motifs in biopolymers.” ISMB 1994.

MEME problem statement
Input: Collection of sequences. Assumption: One TFBS per sequence Output: High-likelihood PSSM

MEME likelihood model TFBS starti sequencei

The EM algorithm optimizes latent variable (missing data) models
TFBS start position C A T G G

X Y θ

X Y θ Want to optimize:

X Y θ Want to optimize: Easier to optimize: E-step: M-step:

Idea #1: Compute probability distribution jointly over all sites
TFBS starts (Y) … sequence (X) E-step:

In order to efficiently compute a probability distribution over many variables, that distribution must be factorizable … Independent: Chain-structure: … Tree-structure:

The MEME probability distribution is efficiently computable because each sequence is independent given θ θ Yi Xi: E-step: Distribution for sequence i: TFBS starti (Yi) sequencei (Xi)

MEME update rule θ Yi Xi: M-step: Update rule: Position #
Sequence # Position # Indicator function Update rule:

Alternating approach Guess an initial weight matrix
Use weight matrix to predict instances in the input sequences Use instances to predict a weight matrix Repeat 2 & 3 until satisfied.

MEME initializes EM separately using each subsequence as a starting point
A C G T One round of EM Choose highest-likelihood initializations run EM to convergence

Scanning for motifs Grant et al. “FIMO: Scanning for occurrences of a given motif.” Bioinformatics 2011.

CTCF One of the most important transcription factors in human cells.
Responsible both for turning genes on and for maintaining 3D structure of the DNA.

Motivating question: How accurately does a PSSM predict the binding of a given transcription factor?

Scanning for motif occurrences
Given: a long DNA sequence, and TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAG a DNA motif represented as a PSSM Find: occurrences of the motif in the sequence A C G T

– = 6.87 TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAG

– 3.32 – = -1.39 TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAG

Searching human chromosome 21 with the CTCF motif

Significance of scores
Motif scanning algorithm 26.30 Low score = not a motif occurrence High score = motif occurrence How high is high enough? TTGACCAGCAGGGGGCGCCG

Two way to assess significance
Empirical Randomly generate data according to the null hypothesis. Use the resulting score distribution to estimate p-values. Exact Mathematically calculate all possible scores

CTCF empirical null distribution

Poor precision in the tail

Converting scores to p-values
Linearly rescale the matrix values to the range [0,100] and integerize.

Find the smallest value. Subtract that value from every entry in the matrix. All entries are now non-negative.

100 / 7 = Find the largest value. Divide 100 by that value. Multiply through by the result. All entries are now between 0 and 100.

Round to the nearest integer.

… A C G T Say that your motif has N columns. Create a matrix that has N rows and 100N columns. The entry in row i, column j is the number of different sequences of length i that can have a score of j.

… A C G T 1 1 1 1 For each value in the first column of your motif, put a 1 in the corresponding entry in the first row of the matrix. There are only 4 possible sequences of length 1.

… A C G T 1 1 1 1 1 For each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix. Add y to the x+zth column of the matrix.

… A C G T 1 1 1 1 1 For each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix. Add y to the x+zth column of the matrix. What values will go in row 2? 10+67, 10+39, 10+71, 10+43, 60+67, …, These 16 values correspond to all 16 strings of length 2.

… A C G T 1 1 1 1 1 In the end, the bottom row contains the scores for all possible sequences of length N. Use these scores to compute a p-value.

Multiple testing correction
Noble. “How does multiple testing correction work?” Nature Biotechnology 2010.

Multiple testing Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations. Assume that all of the observations are explainable by the null hypothesis. What is the chance that at least one of the observations will receive a p-value less than 0.05?

Multiple testing Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations. Assuming that all of the observations are explainable by the null hypothesis, what is the chance that at least one of the observations will receive a p-value less than 0.05? Pr(making a mistake) = 0.05 Pr(not making a mistake) = 0.95 Pr(not making any mistake) = = 0.358 Pr(making at least one mistake) = = 0.642 There is a 64.2% chance of making at least one mistake.

Bonferroni correction
How does it work?

Bonferroni correction
Divide the desired p-value threshold by the number of tests performed. For the previous example, 0.05 / 20 = Pr(making a mistake) = Pr(not making a mistake) = Pr(not making any mistake) = = Pr(making at least one mistake) = = Does not assume that individual tests are independent.

Sample problem You have scanned both strands of the human genome with a single PSSM, yielding 6 × 109 scores. You use dynamic programming to assign a p-value of 2.1 × to the top-scoring match. Is this alignment significance at a 95% confidence threshold? No, because 0.05 / 6 × 109 = 8.3 ×

Proof: Bonferroni adjustment controls the family-wise error rate
Note: Bonferroni adjustment does not require that the tests be independent. Boole’s inequality Definition of p-value m = number of hypotheses m0 = number of null hypotheses ⍺ = desired control pi = ith p-value Definition of m and m0

Types of errors False positive: the algorithm indicates that this position is a binding site, but it actually is not. False negative: the site a binding site, but the algorithm indicates that it is not. Both types of errors are defined relative to some confidence threshold. Typically, researchers are more concerned about false positives.

False discovery proportion
5 FP 13 TP The false discovery proportion (FDP) is the percentage of target sequences above the threshold that are false positives. The FDR is the expected value of the FDP. In the context of motif scanning, the false discovery proportion is the percentage of sites above the threshold that are not binding sites. 33 TN 5 FN Binding site Non-binding site FDP = FP / (FP + TP) = 5/18 = 27.8%

Family-wise error rate vs. false discovery rate
Bonferroni controls the family-wise error rate; i.e., the probability of at least one false positive among the sequences that score better than the threshold. With FDR control, you aim to control the percentage of false positives among the sequences that score better than the threshold.

Controlling the FDR Order the unadjusted p-values p1  p2  …  pm.
To control FDR at level α, Reject the null hypothesis for j = 1, …, j*. (Benjamini & Hochberg, 1995)

FDR example Rank (jα)/m p-value … Choose the largest threshold j so that (jα)/m is less than the corresponding p-value. Approximately 5% of the examples above the line are expected to be false positives.

Summary – Multiple testing correction
Selecting a significance threshold requires evaluating the cost of making a mistake. Bonferroni correction divides the desired p-value threshold by the number of statistical tests performed. The false discovery proportion is the percentage of false positives among the target sequences that score better than the threshold. Use Bonferroni correction when you want to avoid making a single mistake; control the false discovery rate when you can tolerate a certain percentage of mistakes.

Transcription factor binding motifs

Similar presentations

Presentation on theme: "Transcription factor binding motifs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Transcription factor binding motifs

Similar presentations

Presentation on theme: "Transcription factor binding motifs"— Presentation transcript:

Similar presentations

About project

Feedback