Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ab initio motif finding

Similar presentations


Presentation on theme: "Ab initio motif finding"— Presentation transcript:

1

2 Ab initio motif finding
Ryo Shimizu

3 Agenda Background / motivation Paper 1 Paper 2 Conclusion

4 Central Dogma mRNA {A,C,G,U} DNA Amino Acid {A,C,G,T} Protein
Transcription Translation mRNA {A,C,G,U} Amino Acid DNA {A,C,G,T} This whole process of transcribing DNA onto mRNA and then translating mRNA into Amino Acid sequence is known as Gene Expression. Gene expression can be regulated Protein Folding Image credits: DNA, mRNA, Protein, Amino acid

5 Impacts of gene regulation
Functioning of an organism Development of an organism Evolution of organisms Functioning of an organism: The regulation of gene expression by nutrients plays an important role in the overall manifestations of nutritional deficiencies. Development of an organism: The fruitfly Drosophila melanogaster for example completes its entire embryonic development from fertilized egg to crawling larvae with muscles, nervous system, digestive tract and body wall in only 24 hours. This remarkably fast and accurate process is predominantly controlled via the correct spatial and temporal regulation of genes. Evolution of Organisms: The vast differences between humans and chimpanzees are due more to changes in gene regulation than differences in individual genes themselves, researchers argue.

6 Transcription Process in which mRNA is made using DNA as a template
Only genes are transcribed Regulated by transcription factors Transcription factor is a Protein that acts as regulator for gene expression, specifically regulating the activation of transcription

7 Transcription movie

8 Binding Site Region on a protein, DNA, or RNA to which ligands attach
dGTP Hydrogen Bonds Binding Site Ligands – other molecules and ions Figure 1: Binding site for incoming 5’ tri-phosphate nucleotide in DNA polymerase. Incoming dGTP (purple) is stacked between tyrosine residue 526 and 3’ adenine of daughter DNA strand. Positioning of dGTP is promoted by hydrogen bonding (green dotted lines) between protein side chains and DNA phosphate groups.

9 Motif Common sequence “pattern” in the binding sites of a transcription factor A succinct way of capturing variability among the binding sites The figure to the right is an example motif. Bicoid is important control event during early Drosophila embryogenesis. credit

10 Motif representation Consensus Sequence
XTCATCAX Position Specific Scoring Matrix Consensus Sequence: May allow “degenerate” symbols in string, e.g., N = A/C/G/T; W = A/T; S = C/G; R = A/G; Y = T/C etc. PSSM: More powerful representation; gives a probabilistic treatment to the nucleotides A graph PSSM graph

11 Ab initio Motif finding
Say a transcription factor (TF) controls five different genes Each of the five genes will have binding sites for the TF in their promoter region Source: Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Binding sites for TF

12 Ab initio Motif finding
GIVEN: promoter regions of the five genes G1,…,G5 FIND: binding sites of TF, without prior knowledge Source: The important thing to note that although binding sites are similar to each other, they are not necessarily identical. A related problem is to find a motif that represents the binding site of an unknown TF

13 Agenda Background / motivation Paper 1 Paper 2 Conclusion

14 Paper 1 Ab Initio Prediction of Transcription Factor Targets Using Structural Knowledge Tommy Kaplan1,2, Nir Friedman1*, Hanah Margalit2*

15 Overview Known binding sites of others in the same protein family
Identify binding site of new proteins (of that family) Same family ! Same binding specifity of residues Prediction The idea here is to gather statistics about which amino acid of the transcription factor(protein) interacts with which nucleotide in the DNA sequence, and use that to make a prediction about a new transcription factor (protein) Although sharing the same structure, different proteins from a structural family have different binding specifities because of the presence of different residues at the DNA-bining posotions. To predict their binding site motif, residues at these positions need to be identified.

16 Application: Cys2His2 Zinc Finger
Largest known DNA-binding family in multicellular organism Extensively studied Image: Has a stringent binding model

17 Curation of Zinc Finger sequences and their binding sites
31 Experimentally determined canonical domains Classified as Canonical / Non-canonical output train Profile HMM curate Canonical is an adjective derived from canon. Canon essentially means "rule", "law", "standard", and has come to mean "generally accepted" or "authoritatively correct." 31 experimentally determined canonical domain Trained a profile hMM to classify the remaining Cys2His2 Zinc finger domains in TRANSFAC as canonical or non-canonical From these, select proteins with two ~ four properly spaced canonical fingers Result: 61 canonical Cys2His2 Finger proteins, 455 protein-binding site pairs Cys2His2 in TRANSFAC database input 61 Canonical Fingers 455 protein-binding site pairs

18 Identification of DNA-binding residues
canonical binding model of solved protein–DNA complex of Egr-1 In order to estimate the context-specific DNA-recognition preferences of the Cys2His2 Zinc Finger DNA-binding family, they used the canonical binding model learned from the solved protein–DNA complex of Egr-1 According to this model, the binding specificity of each Zinc Finger domain is determined by residues at four key positions (see Figure 1). Interacting residues located at 6,3,2,-1 relative to the beginning of alpha-helix Identify these positions using their relative positioning in the Cys2His2 pattern: CX(2–4)CX(11–13)HX(3–5)H Idea: learn a different set of DNA-recognition preferences for each of the four key positions. These sets should express the probability of every amino acid to interact with each nucleotide. PROSITE motif pattern: CX(2–4)CX(11–13)HX(3–5)H

19 Estimating DNA Recognition Preferences
INPUT: set of pairs of transcription factors and their target DNA sequences. TF Target DNA sequence

20 Probabilistic model of binding preferences
Set of interacting residues in the 4 positions, p of the k fingers E.g. A1,2  Set of interacting residue for finger 1 at position 2 C.Prob of interaction with DNA subseq, starting from jth pos in DNA Pp(N|A)  c. prob. of nucleotide N given amino acid A at position p. N1,…NL  target DNA sequence

21 Where did the P2 term go? the amino acid at position 2 interacts with the nucleotide that is complementary to the nucleotide interacting with position 6 of the previous finger.

22 Estimating DNA Recognition Preferences
Apply Expectation Maximization Identify binding locations Identify the residues and nucleotides that participate in protein-DNA interaction Collect statistics about the DNA-binding preferences of residues under different contexts of the binding domain The EM algorithm is used to simultaneously assess the exact binding positions of each protein–DNA pair and to estimate four sets of position-specific DNA-recognition preferences. Optimize recognition preferences

23 Expectation Maximization algorithm
Initial guess of DNA recog. Pref. Compute expected posterior probability of binding locations for all protein–DNA pair. Maximize likelihood of the current binding positions for all protein–DNA pairs based on the distribution of possible binding locations

24 Probabilistic model output
Tot. height: info content Rel. height: probability C intensity: confidence (Phenylalanine, Cytocine) Prevalent in position 2 Some of the DNA binding preferences are general, regardless of the residue’s position within the zinc finger (e.g., lysine’s tendency to bind guanine), while others are position-dependent (e.g., the tendency of phenylalanine to bind cytosine only when in position 2). (Lysine,Guanine)  irrespective of position

25 Ab initio genome wide prediction
16,201 putative gene products 29 canonical fingers DNA recognition preference Use it on Drosophilia melanogaster genome! Scan 16,201 putative gene products Identified 29 canonical Cys2His2 Zinc finger transcription factors Use the sequence and estimated DNA recognition preference to compile a binding site model for each transcription factor Use these binding site models to scan the upstream promoter regions of 15,665 D. melanogaster genes. Binding site model Scan the promoter regions image

26 Results—D.Melanogaster
Transcription Factors Enriched GO terms match prior biological knowledge Putative targets of Glass were found to be enriched with terms related to photoreceptor cell development (1) Putative targets for Buttonhead (Btd) and Sp1 were enriched with development terms such as neurogenesis and organogenesis (2) Consistent with previous studies that linked Glass transcription factor with eye photoreceptor development Match prior biological knowledge!! GO Terms Blue cells—significant enrichment of GO terms

27 Results—D.Melanogaster
Transcription Factors Based on the expression data, the putative regulators may have an active role in early developmental stages At least one significant embryogenesis experiment Embryogenesis phase

28 Agenda Background / motivation Paper 1 Paper 2 Conclusion

29 Paper 2 MotifCut: regulatory motif finding with maximum density subgraphs Eugene Fratkin, Brian Naughton, Douglass Brutlag, Serafim Batzoglou

30 Drawbacks of existing methods
As an optimization problem Intractable Relies on EM or local heuristic search

31 Drawbacks of existing methods
Perfectly conserved nucleotide dependency — ATG and CAT Resulting PSSM WRONG! Choice of motif model Simple models like PSSM impose biologically unrealistic assumptions More involved models are harder to parametrize and train Indeependence assumption: biologically unrealistic

32 Overview Nodes: k-mers of input sequence
Edges: pairwise k-mer similarity Motif search  maximum density subgraph In this graph, we search for a motif as the maximum density subgraph, which is a set of k-mers that exhibit a large number of pairwise similarities No strong assumptions regarding the structure of the motif Graph-based approach Convex optimization problem Polinomial time solution No assumptions regarding the structure of the motif Minimal training requirment Motif detection scales well with increasing input size. Finds novel motifs Fig. 1. Two yeast motifs: with and without nucleotide dependencies. This diagram shows two graphs of real yeast motifs—ADR1 and YAP6. Each node corresponds to a motif occurrence. Edges connect pairs of k-mers that are identical or differ by one mutation. If we model a motif with a PSSM, we can compute the probability of a specific k-mer being generated by that PSSM. Given the number of motif instances we can convert these probabilities into the expected number of occurrences for each k-mer. This number can be compared with the actual number of occurrences. In the two graphs k-mers that occur less frequently than expected are colored red, k-mers that occur more frequently are colored green, and cases in which observed and expected numbers are equal are colored blue. In such a graphical representation, PSSMgenerated motifs have a single dense center, corresponding to the maximum likelihood k-mers, and the density of k-mers decreases as they are further from that center and hence less likely. The PSSM model is a good fit for the YAP6 motif, but not for the ADR1 motif.

33 MotifCut Algorithm Convert sequence into a collection of k-mers
Each overlap/duplicate considered distinct Each overlap/duplicate considered distinct  one k-mer for each nucleotide position in input sequence The k-mers form the set of vertices, V in G=(V,E)

34 MotifCut Algorithm For every pair of vertices (vi, vj) create an edge with weight wij wij = f(# mismatches bet. k-mers in vi, vj) F is normalized with respect to the nucleotide background distribution, so that more similar k-mers are connected with higher weight edges. The background distribution is used to find the probability of the two k-mers appearing at random given the input. Therefore, the weight of the edges connecting a pair of k-mers that are unlikely to appear in the background is up-weighted. M denote the collection of k-mer occurrences corresponding to the binding sites of a specific transcription factor, and let B denote the background k-mers. Background distribution M  k-mers of binding site B  background k-mers

35 Resulting graph Note: should be maximally connected!

36 MotifCut Algorithm Find the maximum density subgraph
Parametric flow algorithm (Gallo et al, 1989) A type of fractional programming Iteratively apply push/relable to find max-flow and min-cut O(VElog(V2E))  too slow!

37 MDS optimization Pick a center of neighborhood
Discard edges with weight <= w Re-introduce all edges in neighborhood Run MDS in neighborhood Repeat for every vertex

38 Results Synthetic Data Yeast Data vs MEME(Bailey et al, 1995)
vs AlignAce (Hughes et al, 2000) vs BioProspector (Liu et al, 2001) Yeast Data

39 Synthetic benchmark results
In these graphs the X-axis represents the input size, in nucleotides, and the Y-axis represents the percentage of motifs correctly identified. A motif is considered correctly identified if its Pearson correlation with the seeded motif is 0.7 or greater.

40 Results – Running time and yeast data

41 Agenda Background / motivation Paper 1 Paper 2 Conclusion

42 Conclusion Ab initio motif finding Use of structural knowledge
Graph representation of motifs


Download ppt "Ab initio motif finding"

Similar presentations


Ads by Google