Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.

Similar presentations


Presentation on theme: "The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components."— Presentation transcript:

1 The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components Learning two separate models The frequencies for the Gaussian components of the mixture and the parameters for each component were learned separately for the set of known sites predicted by TestMOTIF and the set of the new sites 2 Classification New sites predicted by TestMOTIF can be classified according to their probability of being generated by the first or the second set of parameters Training Sets Each site was represented as a 5-coordinate vector A positive set was constructed out of 159 known sites that were also discovered by TestMOTIF A negative set was constructed out of 159 randomly chosen sites from the set of new sites predicted by TestMOTIF Kernels The sites were classified using 4 different kernels: Gaussian, Linear, Polynomial and Sigmoidal. Cross-Validation A sevenfold cross-validation was performed to evaluate performance using each one of the kernel functions Linear kernel achieved best cross-validation results Sevenfold cross-validation results for Linear kernel Classification Results A classifier was trained on the full set of 318 sites and managed to separate correctly 88.68% of the training data All new sites predicted by TestMOTIF were tagged by the classifier The threshold for defining true binding sites was set to a positive score of 2 936/73607 (~1.3%) sites received a score above the threshold (222 unique pairs of TF and target gene) Final set included new target genes for 51 known transcription factors Sonya Liberman 1,2, Nir Friedman 1 & Hanah Margalit 2 1 School of Computer Science and Engineering, The Hebrew University, Jerusalem, Israel 2 Department of Molecular Genetics and Biotechnology, Faculty of Medicine, The Hebrew University, Jerusalem, Israel Predicting Novel Transcription Factor Binding Sites in Human Using a Machine Learning Approach False Positives Average value False Positives (%) True Negatives Average value True Negatives (%) False Negatives Average value False Negatives (%) True Positives Average value True Positives (%) 0.6413.18%-2.2086.82%-0.9216.32%5.9883.68% Average log probability of newly predicted sites given a model built according to known sites (159) Average log probability of newly predicted sites given a model built according to newly predicted sites (73607) Average log probability of known sites given a model built according to newly predicted sites (73607) Average log probability of known sites given a model built according to known sites (159) 2.29617.631-88.8622.623 1. Sinha, S., M. Blanchette, and M. Tompa, PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics, 2004. 5: p. 170. 2. Neal, R.M., Regression and classification using Gaussian process priors. Oxford University Press, 1998: p. 475- 501. 3. Siepel, A., et al., Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res, 2005. 15(8): p. 1034-50. 4. Li, N. and M. Tompa, Analysis of computational approaches for motif discovery. Algorithms Mol Biol, 2006. 1: p. 8. 5. Shane T. Jensen, X. Shirley Liu., Qing Zhou and Jun S. Liu, Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective. Statistical Science, 2004. 19(1): p. 188-204. 6. Barash Y., Elidan G., Kaplan T., Friedman N. CIS: compound importance sampling method for protein-DNA binding site p-value estimation. Bioinformatics, 2005. 1;21(5): p. 596-600. Transcription factors (TFs) regulate gene expression by binding to specific sequences on the DNA. A major challenge is to expand the known repertoire of TF-target pairs by identifying novel Transcription Factor Binding Sites (TFBS) based on sequence data. One main difficulty in such computational predictions is the large number of false positives they generate. Here we examine the association of five features with TFBS and show that they differ between true binding sites and similar sequences that are predicted as binding sites. Using machine learning approaches, we developed a computational scheme for TFBSs prediction, in which prediction of sites based on sequence data is subjected to filtering and further classification according to these features. This results in a significant reduction in the number of false positive predictions and enables the construction of a more accurate transcription regulation network. 1 Known human TFBSs from TRANSFAC database were mapped onto the human genome 210 sites were chosen as a reliable set of known TFBSs Promoters of 150 genes were searched for putative binding sites for 98 different TFs We predicted ~150,000 statistically significant new sites including 174 out of 210 known TFBSs (~83%) True Positives False Positives 83% Known TFBSs False Negatives 3 4 6 7 ? To differentiate between true positive predictions and false positive predictions Evolutionary Conservation Number of neighboring sites with a similar sequence Number of neighboring known binding sites of other transcription factors Gene 1 Gene 2 Clustered TFBS Scattered TFBS Sites for which distance is less than 200 bp are considered neighbors Known TFBS Predicted TFBS Different shapes indicate BSs for different TFs Promoter with a knwon site Gene 1 Gene 2 Promoter without a knwon site Known TFBS Predicted TFBS Distance from the TSS (Transcription Start Site) of the target gene Position relative to TSS Number of real sites 61% of sites are located within the 200 bp upstream to TSS. 75% are located within the 400 bp upstream to TSS Distribution of the distance of sites from their target genes. Only several TFs have a specific binding orientation, i.e.: E2F has a defined orientation of upstream binding sites. (90% have same orientation) EBOX has a defined orientation of downstream binding sites. (86%) Unfortunately only few transcription factors have enough known binding sites to enable reliable statistics. Orientation of the transcription factor binding AACCCA TTGGGT Gene 1 TGGGTT ACCCAA Gene 3 AACCC A TTGGGT Gene 1 AACCC A TTGGGT Gene 3 X 5 Transcription is governed by cis-regulatory elements and associated transcription factors In order to predict new TFBSs we use motifs of known TFBS represented by PSSMS We use a motif search tool (TestMOTIF 6 ) that predicts new TFBS in promoter sequences according to known motifs, and assigns a p-value to each prediction AATGATGC TTACTACG GCATCATT CGTAGTAA AATGATGC TTACTACG GENE 2 Average Conservation Score Sites predicted by a motif search toolKnown Transcription Factor Biding Sites 0.240.57 Known TFBSs are on average more conserved than other predicted sites Average number of neighboring known TFBSs Sites predicted by a motif search toolKnown Transcription Factor Biding Sites 0.441.04 Average number of sites with a similar sequence Promoter without a known TFBSPromoter with a known TFBS 10.5214.65 Known TFBSs have on average more neighbors among known TFBSs than other predicted sites do Known TFBSs tend to be surrounded by other sites that match their motif Conservation Distance from TSS Number of neighboring binding sites Orientation of transcription factor binding (X 1,X 2,X 3,X 4,X 5 ) Number of neighboring sites that fit the motif Dr. Yael Altuvia for her help with the feature definition Tommy Kaplan for his help with the TestMOTIF tool The differentiation is made based on the following five features: Evolutionary conservation Number of neighboring known binding sites of other TFS Number of neighboring sites with a similar sequence Distance from the TSS of the target gene Orientation of the transcription factor binding


Download ppt "The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components."

Similar presentations


Ads by Google