Presentation is loading. Please wait.

Presentation is loading. Please wait.

CpG islands in DNA sequences

Similar presentations


Presentation on theme: "CpG islands in DNA sequences"— Presentation transcript:

1 CpG islands in DNA sequences
A modeling Example A+ C+ G+ T+ A- C- G- T- CpG islands in DNA sequences

2 Methylation & Silencing
One way cells differentiate is methylation Addition of CH3 in C-nucleotides Silences genes in region CG (denoted CpG) often mutates to TG, when methylated In each cell, one copy of X is silenced, methylation plays role Methylation is inherited during cell division From Wikipedia: DNA methylation is a type of chemical modification of DNA that can be inherited without changing the DNA sequence. As such, it is part of the epigenetic code. DNA methylation involves the addition of a methyl group to DNA — for example, to the number 5 carbon of the cytosine pyrimidine ring. DNA methylation is probably universal in eukaryotes. In humans, approximately 1% of DNA bases undergo DNA methylation. In adult somatic tissues, DNA methylation typically occurs in a CpG dinucleotide context; non-CpG methylation is prevalent in embryonic stem cells. In plants, cytosines are methylated both symmetrically (CpG or CpNpG) and asymmetrically (CpNpNp), where N can be any nucleotide. DNA methylation in mammals Between 60-70% of all CpGs are methylated. Unmethylated CpGs are grouped in clusters called "CpG islands" that are present in the 5' regulatory regions of many genes. In many disease processes such as cancer, gene promoter CpG islands acquire abnormal hypermethylation, which results in heritable transcriptional silencing. Reinforcement of the transcriptionally silent state is mediated by proteins that can bind methylated CpGs. These proteins, which are called methyl-CpG binding proteins, recruit histone deacetylases and other chromatin remodelling proteins that can modify histones, thereby forming compact, inactive chromatin termed heterochromatin. This link between DNA methylation and chromatin structure is very important. In particular, loss of Methyl-CpG-binding Protein 2 (MeCP2) has been implicated in Rett syndrome and Methyl-CpG binding domain protein 2 (MBD2) mediates the transcriptional silencing of hypermethylated genes in cancer. [edit] DNA methylation in humans In humans, the process of DNA methylation is carried out by three enzymes, DNA methyltransferase 1, 3a, and 3b (DNMT1, DNMT3a, DNMT3b). It is thought that DNMT3a and DNMT3b are the de novo methyltransferases that set up DNA methylation patterns early in development. DNMT1 is the proposed maintenance methyltransferase that is responsible for copying DNA methylation patterns to the daughter strands during DNA replication. DNMT3L is a protein that is homologous to the other DNMTs but has no catalytic activity. Instead, DNMT3L assists the de novo methyltransferases by increasing their ability to bind to DNA and stimulating their activity. Since many tumor suppressor genes are silenced by DNA methylation during carcinogenesis, there have been attempts to re-express these genes by inhibiting the DNMTs. 5-aza-2'-deoxycytidine (decitabine) is a nucleoside analog that inhibits DNMTs by trapping them in a covalent complex on DNA by preventing the ß-elimination step of catalysis, thus resulting in the enzymes' degradation. However, for decitabine to be active, it must be incorporated into the genome of the cell, but this can cause mutations in the daughter cells if the cell does not die. Additionally, decitabine is toxic to the bone marrow, a fact which limits the size of its therapeutic window. These pitfalls have led to the development of antisense RNA therapies that target the DNMTs by degrading their mRNAs and preventing their translation. However, it is currently unclear if targeting DNMT1 alone is sufficient to reactivate tumor suppressor genes silenced by DNA methylation. DNA methylation in plants Significant progress has been made in understanding DNA methylation in plants, specifically in the model plant, Arabidopsis thaliana. The principal DNA methyltransferases in A. thaliana, Met1, Cmt3, and Drm2, are similar at a sequence level to the mammalian methyltransferases. Drm2 is thought to participate in de-novo DNA methylation as well as in the maintenance of DNA methylation. Cmt3 and Met1 act principally in the maintenance of DNA methylation [1]. Other DNA methyltransferases are expressed in plants but have no known function (see [2]). The specificity for DNA methyltransferases is thought to be driven by RNA-directed DNA methylation. Specific RNA transcripts are produced from a genomic DNA template. These RNA transcripts may form double-stranded RNA molecules. The double stranded RNAs, through either the small interfering RNA (siRNA) or micro RNA (miRNA) pathways, direct the localization of DNA methyltransferases to specific targets in the genome [3]

3 Example: CpG Islands CpG nucleotides in the genome are frequently methylated (Write CpG not to confuse with CG base pair) C  methyl-C  T Methylation often suppressed around genes, promoters  CpG islands

4 Example: CpG Islands In CpG islands,
CG is more frequent Other pairs (AA, AG, AT…) have different frequencies Question: Detect CpG islands computationally

5 A model of CpG Islands – (1) Architecture
Not CpG Island

6 A model of CpG Islands – (2) Transitions
How do we estimate parameters of the model? Emission probabilities: 1/0 Transition probabilities within CpG islands Established from known CpG islands (Training Set) Transition probabilities within other regions Established from known non-CpG islands Note: these transitions out of each state add up to one—no room for transitions between (+) and (-) states + A C G T .180 .274 .426 .120 .171 .368 .188 .161 .339 .375 .125 .079 .355 .384 .182 = 1 = 1 = 1 = 1 - A C G T .300 .205 .285 .210 .233 .298 .078 .302 .248 .246 .208 .177 .239 .292 = 1 = 1 = 1 = 1

7 Log Likehoods— Telling “CpG Island” from “Non-CpG Island”
Another way to see effects of transitions: Log likelihoods L(u, v) = log[ P(uv | + ) / P(uv | -) ] Given a region x = x1…xN A quick-&-dirty way to decide whether entire x is CpG P(x is CpG) > P(x is not CpG)  i L(xi, xi+1) > 0 A C G T -0.740 +0.419 +0.580 -0.803 -0.913 +0.302 +1.812 -0.685 -0.624 +0.461 +0.331 -0.730 -1.169 +0.573 +0.393 -0.679

8 A model of CpG Islands – (2) Transitions
What about transitions between (+) and (-) states? They affect Avg. length of CpG island Avg. separation between two CpG islands 1-p Length distribution of region X: P[lX = 1] = 1-p P[lX = 2] = p(1-p) P[lX= k] = pk-1(1-p) E[lX] = 1/(1-p) Geometric distribution, with mean 1/(1-p) X Y p q 1-q

9 A model of CpG Islands – (2) Transitions
Right now, aA+A+ + aA+C+ + aA+G+ + aA+T+ = 1 We need to adjust aij so as to allow transitions between (+) and (-) states Say we want with probability p++ to stay within CpG, p-- to stay within non-CpG Let’s adjust all probs by that factor: example, let aA+G+  p++ × aA+G+ Now, let’s calculate probs between (+) and (-) states Total prob aA+S where S is a (-) state, is (1 – p++) Let qA-, qC-, qG-, qT- be the proportions of A, C, G, and T within non-CpG states in training set Then, let aA+A- = (1 – p++) × qA-; aA+C- = (1 – p++) × qC-; … Do the same for (-) to (+) states OK, but how do we estimate p++ and p--? Estimate average length of a CpG island: l+ = 1/(1-p)  p = 1 – 1/l+ Do the same for length between two CpG islands, l- What we just did is a back-of the envelope learning procedure We adjusted the parameters in a manner similar to the learning algorithms we will cover in this lecture A+ C+ G+ T+ 1–p++ A- C- G- T-

10 Applications of the model
Given a DNA region x, The Viterbi algorithm predicts locations of CpG islands Given a nucleotide xi, (say xi = A) The Viterbi parse tells whether xi is in a CpG island in the most likely general scenario The Forward/Backward algorithms can calculate P(xi is in CpG island) = P(i = A+ | x) Posterior Decoding can assign locally optimal predictions of CpG islands ^i = argmaxk P(i = k | x) Advantage: ?Each nucleotide is more likely to be called correctly Disadvantage: ?The overall parse will be “choppy”—CpG islands too short Advantage/Disadvantage?

11 What if a new genome comes?
We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG islands for porcupines We suspect the frequency and characteristics of CpG islands are quite different in porcupines How do we adjust the parameters in our model? LEARNING

12 Re-estimate the parameters of the model based on training data
Problem 3: Learning Re-estimate the parameters of the model based on training data

13 Two learning scenarios
Estimation when the “right answer” is known Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good (experimental) annotations of the CpG islands GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls Estimation when the “right answer” is unknown GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice QUESTION: Update the parameters  of the model to maximize P(x|)

14 1. When the right answer is known
Given x = x1…xN for which the true  = 1…N is known, Define: Akl = # times kl transition occurs in  Ek(b) = # times state k in  emits b in x We can show that the maximum likelihood parameters  (maximize P(x|)) are: Akl Ek(b) akl = ––––– ek(b) = ––––––– i Aki c Ek(c)

15 1. When the right answer is known
Intuition: When we know the underlying states, Best estimate is the normalized frequency of transitions & emissions that occur in the training data Drawback: Given little data, there may be overfitting: P(x|) is maximized, but  is unreasonable 0 probabilities – BAD Example: Given 10 casino rolls, we observe x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3  = F, F, F, F, F, F, F, F, F, F Then: aFF = 1; aFL = 0 eF(1) = eF(3) = .2; eF(2) = .3; eF(4) = 0; eF(5) = eF(6) = .1

16 Pseudocounts Solution for small training sets: Add pseudocounts
Akl = # times kl transition occurs in  + rkl Ek(b) = # times state k in  emits b in x + rk(b) rkl, rk(b) are pseudocounts representing our prior belief Larger pseudocounts  Strong priof belief Small pseudocounts ( < 1): just to avoid 0 probabilities

17 Pseudocounts r0F = r0L = rF0 = rL0 = 1; Example: dishonest casino
We will observe player for one day, 600 rolls Reasonable pseudocounts: r0F = r0L = rF0 = rL0 = 1; rFL = rLF = rFF = rLL = 1; rF(1) = rF(2) = … = rF(6) = 20 (strong belief fair is fair) rL(1) = rL(2) = … = rL(6) = 5 (wait and see for loaded) Above #s are arbitrary – assigning priors is an art

18 2. When the right answer is unknown
We don’t know the true Akl, Ek(b) Idea: We estimate our “best guess” on what Akl, Ek(b) are Or, we start with random / uniform values We update the parameters of the model, based on our guess We repeat

19 2. When the right answer is unknown
Starting with our best guess of a model M, parameters : Given x = x1…xN for which the true  = 1…N is unknown, We can get to a provably more likely parameter set  i.e.,  that increases the probability P(x | ) Principle: EXPECTATION MAXIMIZATION Estimate Akl, Ek(b) in the training data Update  according to Akl, Ek(b) Repeat 1 & 2, until convergence

20 Estimating new parameters
To estimate Akl: (assume “| CURRENT”, in all formulas below) At each position i of sequence x, find probability transition kl is used: P(i = k, i+1 = l | x) = [1/P(x)]  P(i = k, i+1 = l, x1…xN) = Q/P(x) where Q = P(x1…xi, i = k, i+1 = l, xi+1…xN) = = P(i+1 = l, xi+1…xN | i = k) P(x1…xi, i = k) = = P(i+1 = l, xi+1xi+2…xN | i = k) fk(i) = = P(xi+2…xN | i+1 = l) P(xi+1 | i+1 = l) P(i+1 = l | i = k) fk(i) = = bl(i+1) el(xi+1) akl fk(i) fk(i) akl el(xi+1) bl(i+1) So: P(i = k, i+1 = l | x, ) = –––––––––––––––––– P(x | CURRENT)

21 Estimating new parameters
So, Akl is the E[# times transition kl, given current ] fk(i) akl el(xi+1) bl(i+1) Akl = i P(i = k, i+1 = l | x, ) = i ––––––––––––––––– P(x | ) Similarly, Ek(b) = [1/P(x | )] {i | xi = b} fk(i) bk(i) fk(i) bl(i+1) akl k l x1………xi-1 xi+2………xN el(xi+1) xi xi+1

22 The Baum-Welch Algorithm
Initialization: Pick the best-guess for model parameters (or arbitrary) Iteration: Forward Backward Calculate Akl, Ek(b), given CURRENT Calculate new model parameters NEW : akl, ek(b) Calculate new log-likelihood P(x | NEW) GUARANTEED TO BE HIGHER BY EXPECTATION-MAXIMIZATION Until P(x | ) does not change much

23 The Baum-Welch Algorithm
Time Complexity: # iterations  O(K2N) Guaranteed to increase the log likelihood P(x | ) Not guaranteed to find globally best parameters Converges to local optimum, depending on initial conditions Too many parameters / too large model: Overtraining

24 Alternative: Viterbi Training
Initialization: Same Iteration: Perform Viterbi, to find * Calculate Akl, Ek(b) according to * + pseudocounts Calculate the new parameters akl, ek(b) Until convergence Notes: Not guaranteed to increase P(x | ) Guaranteed to increase P(x | , *) In general, worse performance than Baum-Welch

25 Conditional Random Fields
A brief description

26 Let’s look at an HMM again
1 2 K 1 2 K 1 1 2 K 1 2 K 2 2 K x1 x2 x3 xN HMMs and Dynamic Programming “Best” state sequence for 1…i interacts with “best” sequence for i+1…N using K2 arrows Vl(i+1) = el(i+1) maxk Vk(i) akl = maxk( Vk(i) + [ e(l, i+1) + a(k, l) ] ) (where e(.,.) and a(.,.) are logs) Total likelihood of all state sequences for 1…i+1 can be calculated from total likelihood for 1…i by only summing up K2 arrows

27 Let’s look at an HMM again
1 2 K 1 2 K 1 1 2 K 1 2 K 2 2 K x1 x2 x3 xN Some shortcomings of HMMs Can’t model state duration Solution: explicit duration models (Semi-Markov HMMs) Unfortunately, state i cannot “look” at any letter other than xi! Strong independence assumption: P(i | x1…xi-1, 1…i-1) = P(i | i-1)

28 Let’s look at an HMM again
1 2 K 1 2 K 1 1 2 K 1 2 K 2 2 K x1 x2 x3 xN Another way to put this, features used in objective function P(x, ): akl, ek(b), where b   At position i: K2 akl features, and ||K el(xi) features play a role Vl(i) = Vk(i – 1) + (a(k, l) + e(l, i)) = Vk(i – 1) + g(k, l, xi) Let’s generalize g!!! Vk(i – 1) + g(k, l, x, i)

29 “Features” that depend on many pos. in x
What do we put in g(k, l, x, i)? The “higher” g(k, l, x, i), the more we “favor” going from k to l at position i Richer models using this additional power Examples If previous 100 pos’ns have > 50 6s, casino player favors changing to Fair g(Loaded, Fair, x, i) += 1[xi-100, …, xi-1 has > 50 6s]  wDON’T_GET_CAUGHT Genes are close to CpG islands; for any state k, g(k, exon, x, i) += 1[xi-1000, …, xi+1000 has > 1/16 CpG]  wCG_RICH_REGION

30 “Features” that depend on many pos. in x
Conditional Random Fields—Features Define a set of features that you think are important All features should be functions of current state, previous state, x, and position i Example: Old features: transition kl, emission b from state k Plus new features: prev 100 letters have 50 6s Number the features 1…n: f1(k, l, x, i), …, fn(k, l, x, i) features are indicator true/false variables Find appropriate weights w1,…, wn for when each feature is true weights are the parameters of the model Let’s assume for now each feature has a weight wj Then, g(k, l, x, i) = j=1…n fj(k, l, x, i)  wj

31 “Features” that depend on many pos. in x
Define Vk(i): Optimal score of “parsing” x1…xi and ending in state k Then, assuming Vk(i) is optimal for every k at position i, it follows that Vl(i+1) = maxk [Vk(i) + g(k, l, x, i+1)] Why? Even though at pos’n i+1 we “look” at arbitrary positions in x, the choice of optimal state l at i+1 is only affected by the choice of ending state k Therefore, Viterbi algorithm again finds optimal (highest scoring) parse for x1…xN

32 “Features” that depend on many pos. in x
1 x1 2 x2 3 x3 4 x4 5 x5 6 x6 HMM 1 x1 2 x2 3 x3 4 x4 5 x5 6 x6 CRF Score of a parse depends on all of x at each position Can still do Viterbi because state i only “looks” at prev. state i-1 and the constant sequence x

33 How many parameters are there, in general?
Arbitrarily many parameters! For example, let fj(k, l, x, i) depend on xi-5, xi-4, …, xi+5 Then, we would have up to K  |  |11 parameters! Advantage: powerful, expressive model Example: “if there are more than 50 6’s in the last 100 rolls, but in the surrounding 18 rolls there are at most 3 6’s, this is evidence we are in Fair state” Interpretation: casino player is afraid to be caught, so switches to Fair when he sees too many 6’s Example: “if there are any CG-rich regions in the vicinity (window of 2000 pos) then favor predicting lots of genes in this region” Question: how do we train these parameters?

34 Conditional Training P(x, ) = P( | x) P(x)
Hidden Markov Model training: Given training sequence x, “true” parse  Maximize P(x, ) Disadvantage: P(x, ) = P( | x) P(x) Quantity we care about so as to get a good parse Quantity we don’t care so much about because x is always given

35 Conditional Training F(j, x, ) = # times feature fj occurs in (x, )
P(x, ) = P( | x) P(x) P( | x) = P(x, ) / P(x) Recall F(j, x, ) = # times feature fj occurs in (x, ) = i=1…N fj(k, l, x, i) ; count fj in x,  In HMMs, let’s denote by wj the weight of jth feature: wj = log(akl) or log(ek(b)) Then, HMM: P(x, ) = exp[j=1…n wj  F(j, x, )] CRF: Score(x, ) = exp[j=1…n wj  F(j, x, )]

36 QUESTION: Why is this a probability???
Conditional Training In HMMs, P( | x) = P(x, ) / P(x) P(x, ) = exp[j=1…n wjF(j, x, )] P(x) =  exp[j=1…n wjF(j, x, )] =: Z Then, in CRF we can do the same to normalize Score(x, ) into a prob. PCRF( | x) = exp[j=1…n wjF(j, x, )] / Z QUESTION: Why is this a probability???

37 P( | x) = exp[j=1…n wjF(j, x, )] / Z is convex
Conditional Training We need to be given a set of sequences x and “true” parses  Calculate Z by a sum-of-paths algorithm similar to HMM We can then easily calculate P( | x) Calculate partial derivative of P( | x) w.r.t. each parameter wj (not covered—akin to forward/backward) Update each parameter with gradient descent Continue until convergence to optimal set of weights P( | x) = exp[j=1…n wjF(j, x, )] / Z is convex

38 Conditional Random Fields—Summary
Ability to incorporate complicated non-local feature sets Do away with some independence assumptions of HMMs Parsing is still equally efficient Conditional training Train parameters that are best for parsing, not modeling Need labeled examples—sequences x and “true” parses  (Can train on unlabeled sequences, however it is unreasonable to train too many parameters this way) Training is significantly slower—many iterations of forward/backward


Download ppt "CpG islands in DNA sequences"

Similar presentations


Ads by Google