Presentation is loading. Please wait.

Presentation is loading. Please wait.

CpG islands in DNA sequences

Similar presentations


Presentation on theme: "CpG islands in DNA sequences"— Presentation transcript:

1 CpG islands in DNA sequences
A modeling Example A+ C+ G+ T+ A- C- G- T- CpG islands in DNA sequences

2 Methylation & Silencing
One way cells differentiate is methylation Addition of CH3 in C-nucleotides Silences genes in region CG (denoted CpG) often mutates to TG, when methylated In each cell, one copy of X is silenced, methylation plays role Methylation is inherited during cell division From Wikipedia: DNA methylation is a type of chemical modification of DNA that can be inherited without changing the DNA sequence. As such, it is part of the epigenetic code. DNA methylation involves the addition of a methyl group to DNA — for example, to the number 5 carbon of the cytosine pyrimidine ring. DNA methylation is probably universal in eukaryotes. In humans, approximately 1% of DNA bases undergo DNA methylation. In adult somatic tissues, DNA methylation typically occurs in a CpG dinucleotide context; non-CpG methylation is prevalent in embryonic stem cells. In plants, cytosines are methylated both symmetrically (CpG or CpNpG) and asymmetrically (CpNpNp), where N can be any nucleotide. DNA methylation in mammals Between 60-70% of all CpGs are methylated. Unmethylated CpGs are grouped in clusters called "CpG islands" that are present in the 5' regulatory regions of many genes. In many disease processes such as cancer, gene promoter CpG islands acquire abnormal hypermethylation, which results in heritable transcriptional silencing. Reinforcement of the transcriptionally silent state is mediated by proteins that can bind methylated CpGs. These proteins, which are called methyl-CpG binding proteins, recruit histone deacetylases and other chromatin remodelling proteins that can modify histones, thereby forming compact, inactive chromatin termed heterochromatin. This link between DNA methylation and chromatin structure is very important. In particular, loss of Methyl-CpG-binding Protein 2 (MeCP2) has been implicated in Rett syndrome and Methyl-CpG binding domain protein 2 (MBD2) mediates the transcriptional silencing of hypermethylated genes in cancer. [edit] DNA methylation in humans In humans, the process of DNA methylation is carried out by three enzymes, DNA methyltransferase 1, 3a, and 3b (DNMT1, DNMT3a, DNMT3b). It is thought that DNMT3a and DNMT3b are the de novo methyltransferases that set up DNA methylation patterns early in development. DNMT1 is the proposed maintenance methyltransferase that is responsible for copying DNA methylation patterns to the daughter strands during DNA replication. DNMT3L is a protein that is homologous to the other DNMTs but has no catalytic activity. Instead, DNMT3L assists the de novo methyltransferases by increasing their ability to bind to DNA and stimulating their activity. Since many tumor suppressor genes are silenced by DNA methylation during carcinogenesis, there have been attempts to re-express these genes by inhibiting the DNMTs. 5-aza-2'-deoxycytidine (decitabine) is a nucleoside analog that inhibits DNMTs by trapping them in a covalent complex on DNA by preventing the ß-elimination step of catalysis, thus resulting in the enzymes' degradation. However, for decitabine to be active, it must be incorporated into the genome of the cell, but this can cause mutations in the daughter cells if the cell does not die. Additionally, decitabine is toxic to the bone marrow, a fact which limits the size of its therapeutic window. These pitfalls have led to the development of antisense RNA therapies that target the DNMTs by degrading their mRNAs and preventing their translation. However, it is currently unclear if targeting DNMT1 alone is sufficient to reactivate tumor suppressor genes silenced by DNA methylation. DNA methylation in plants Significant progress has been made in understanding DNA methylation in plants, specifically in the model plant, Arabidopsis thaliana. The principal DNA methyltransferases in A. thaliana, Met1, Cmt3, and Drm2, are similar at a sequence level to the mammalian methyltransferases. Drm2 is thought to participate in de-novo DNA methylation as well as in the maintenance of DNA methylation. Cmt3 and Met1 act principally in the maintenance of DNA methylation [1]. Other DNA methyltransferases are expressed in plants but have no known function (see [2]). The specificity for DNA methyltransferases is thought to be driven by RNA-directed DNA methylation. Specific RNA transcripts are produced from a genomic DNA template. These RNA transcripts may form double-stranded RNA molecules. The double stranded RNAs, through either the small interfering RNA (siRNA) or micro RNA (miRNA) pathways, direct the localization of DNA methyltransferases to specific targets in the genome [3]

3 Example: CpG Islands CpG nucleotides in the genome are frequently methylated (Write CpG not to confuse with CG base pair) C  methyl-C  T Methylation often suppressed around genes, promoters  CpG islands

4 Example: CpG Islands In CpG islands,
CG is more frequent Other pairs (AA, AG, AT…) have different frequencies Question: Detect CpG islands computationally

5 A model of CpG Islands – (1) Architecture
Not CpG Island

6 A model of CpG Islands – (2) Transitions
How do we estimate parameters of the model? Emission probabilities: 1/0 Transition probabilities within CpG islands Established from known CpG islands (Training Set) Transition probabilities within other regions Established from known non-CpG islands Note: these transitions out of each state add up to one—no room for transitions between (+) and (-) states + A C G T .180 .274 .426 .120 .171 .368 .188 .161 .339 .375 .125 .079 .355 .384 .182 = 1 = 1 = 1 = 1 - A C G T .300 .205 .285 .210 .233 .298 .078 .302 .248 .246 .208 .177 .239 .292 = 1 = 1 = 1 = 1

7 Log Likehoods— Telling “CpG Island” from “Non-CpG Island”
Another way to see effects of transitions: Log likelihoods L(u, v) = log[ P(uv | + ) / P(uv | -) ] Given a region x = x1…xN A quick-&-dirty way to decide whether entire x is CpG P(x is CpG) > P(x is not CpG)  i L(xi, xi+1) > 0 A C G T -0.740 +0.419 +0.580 -0.803 -0.913 +0.302 +1.812 -0.685 -0.624 +0.461 +0.331 -0.730 -1.169 +0.573 +0.393 -0.679

8 A model of CpG Islands – (2) Transitions
What about transitions between (+) and (-) states? They affect Avg. length of CpG island Avg. separation between two CpG islands 1-p Length distribution of region X: P[lX = 1] = 1-p P[lX = 2] = p(1-p) P[lX= k] = pk-1(1-p) E[lX] = 1/(1-p) Geometric distribution, with mean 1/(1-p) X Y p q 1-q

9 Applications of the model
Given a DNA region x, The Viterbi algorithm predicts locations of CpG islands Given a nucleotide xi, (say xi = A) The Viterbi parse tells whether xi is in a CpG island in the most likely general scenario The Forward/Backward algorithms can calculate P(xi is in CpG island) = P(i = A+ | x) Posterior Decoding can assign locally optimal predictions of CpG islands ^i = argmaxk P(i = k | x) Advantage: ?Each nucleotide is more likely to be called correctly Disadvantage: ?The overall parse will be “choppy”—CpG islands too short Advantage/Disadvantage?

10 What if a new genome comes?
We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG islands for porcupines We suspect the frequency and characteristics of CpG islands are quite different in porcupines How do we adjust the parameters in our model? LEARNING

11 Re-estimate the parameters of the model based on training data
Learning Re-estimate the parameters of the model based on training data

12 Two learning scenarios
Estimation when the “right answer” is known Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good (experimental) annotations of the CpG islands GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls Estimation when the “right answer” is unknown GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice QUESTION: Update the parameters  of the model to maximize P(x|)

13 1. When the right answer is known
Given x = x1…xN for which the true  = 1…N is known, Define: Akl = # times kl transition occurs in  Ek(b) = # times state k in  emits b in x We can show that the maximum likelihood parameters  (maximize P(x|)) are: Akl Ek(b) akl = ––––– ek(b) = ––––––– i Aki c Ek(c)

14 1. When the right answer is known
Intuition: When we know the underlying states, Best estimate is the normalized frequency of transitions & emissions that occur in the training data Drawback: Given little data, there may be overfitting: P(x|) is maximized, but  is unreasonable 0 probabilities – BAD Example: Given 10 casino rolls, we observe x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3  = F, F, F, F, F, F, F, F, F, F Then: aFF = 1; aFL = 0 eF(1) = eF(3) = .2; eF(2) = .3; eF(4) = 0; eF(5) = eF(6) = .1

15 Pseudocounts Solution for small training sets: Add pseudocounts
Akl = # times kl transition occurs in  + rkl Ek(b) = # times state k in  emits b in x + rk(b) rkl, rk(b) are pseudocounts representing our prior belief Larger pseudocounts  Strong priof belief Small pseudocounts ( < 1): just to avoid 0 probabilities

16 Pseudocounts r0F = r0L = rF0 = rL0 = 1; Example: dishonest casino
We will observe player for one day, 600 rolls Reasonable pseudocounts: r0F = r0L = rF0 = rL0 = 1; rFL = rLF = rFF = rLL = 1; rF(1) = rF(2) = … = rF(6) = 20 (strong belief fair is fair) rL(1) = rL(2) = … = rL(6) = 5 (wait and see for loaded) Above #s are arbitrary – assigning priors is an art

17 2. When the right answer is unknown
We don’t know the true Akl, Ek(b) Idea: We estimate our “best guess” on what Akl, Ek(b) are Or, we start with random / uniform values We update the parameters of the model, based on our guess We repeat

18 2. When the right answer is unknown
Starting with our best guess of a model M, parameters : Given x = x1…xN for which the true  = 1…N is unknown, We can get to a provably more likely parameter set  i.e.,  that increases the probability P(x | ) Principle: EXPECTATION MAXIMIZATION Estimate Akl, Ek(b) in the training data Update  according to Akl, Ek(b) Repeat 1 & 2, until convergence

19 Estimating new parameters
To estimate Akl: (assume “| CURRENT”, in all formulas below) At each position i of sequence x, find probability transition kl is used: P(i = k, i+1 = l | x) = [1/P(x)]  P(i = k, i+1 = l, x1…xN) = Q/P(x) where Q = P(x1…xi, i = k, i+1 = l, xi+1…xN) = = P(i+1 = l, xi+1…xN | i = k) P(x1…xi, i = k) = = P(i+1 = l, xi+1xi+2…xN | i = k) fk(i) = = P(xi+2…xN | i+1 = l) P(xi+1 | i+1 = l) P(i+1 = l | i = k) fk(i) = = bl(i+1) el(xi+1) akl fk(i) fk(i) akl el(xi+1) bl(i+1) So: P(i = k, i+1 = l | x, ) = –––––––––––––––––– P(x | CURRENT)

20 Estimating new parameters
So, Akl is the E[# times transition kl, given current ] fk(i) akl el(xi+1) bl(i+1) Akl = i P(i = k, i+1 = l | x, ) = i ––––––––––––––––– P(x | ) Similarly, Ek(b) = [1/P(x | )] {i | xi = b} fk(i) bk(i) fk(i) bl(i+1) akl k l x1………xi-1 xi+2………xN el(xi+1) xi xi+1

21 The Baum-Welch Algorithm
Initialization: Pick the best-guess for model parameters (or arbitrary) Iteration: Forward Backward Calculate Akl, Ek(b), given CURRENT Calculate new model parameters NEW : akl, ek(b) Calculate new log-likelihood P(x | NEW) GUARANTEED TO BE HIGHER BY EXPECTATION-MAXIMIZATION Until P(x | ) does not change much

22 The Baum-Welch Algorithm
Time Complexity: # iterations  O(K2N) Guaranteed to increase the log likelihood P(x | ) Not guaranteed to find globally best parameters Converges to local optimum, depending on initial conditions Too many parameters / too large model: Overtraining

23 Alternative: Viterbi Training
Initialization: Same Iteration: Perform Viterbi, to find * Calculate Akl, Ek(b) according to * + pseudocounts Calculate the new parameters akl, ek(b) Until convergence Notes: Not guaranteed to increase P(x | ) Guaranteed to increase P(x | , *) In general, worse performance than Baum-Welch


Download ppt "CpG islands in DNA sequences"

Similar presentations


Ads by Google