The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome ECS289A.
CIS: Compound Importance Sampling for Binding Site p-value Estimation The Hebrew University, Jerusalem, Israel Yoseph Barash Gal Elidan Tommy Kaplan Nir.
Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir.
GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
Periodic clusters. Non periodic clusters That was only the beginning…
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
. Context-Specific Bayesian Clustering for Gene Expression Data Yoseph Barash Nir Friedman School of Computer Science & Engineering Hebrew University.
Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
Predicting Enhancers in Co-Expressed Genes Harshit Maheshwari Prabhat Pandey.
Table 2 shows that the set TFsf-TGblbs of predicted regulatory links has better results than the other two sets, based on having a significantly higher.
Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium.
A Novel Knowledge Based Method to Predicting Transcription Factor Targets
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
From Sequence to Expression: A Probabilistic Framework Eran Segal (Stanford) Joint work with: Yoseph Barash (Hebrew U.) Itamar Simon (Whitehead Inst.)
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
Cis/TF discovery for Arabidopsis Aristotelis Tsirigos NYU Computer Science.
Nucleotide Level We define four statistics to describe how results are scored at the nucleotide level. If a base is part of an actual site and is predicted.
Transcription factor binding motifs (part I) 10/17/07.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
An analysis of “Alignments anchored on genomic landmarks can aid in the identification of regulatory elements” by Kannan Tharakaraman et al. Sarah Aerni.
1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, (16 April 2004)
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Searching for TFBSs with TRANSFAC - Hot topics in Bioinformatics.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Marcin Pacholczyk, Silesian University of Technology.
Proliferation cluster (G12) Figure S1 A The proliferation cluster is a stable one. A dendrogram depicting results of cluster analysis of all varying genes.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Figure 2: over-representation of neighbors in the fushi-tarazu region of Drosophila melanogaster. Annotated enhancers are marked grey. The CDS is marked.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Journal report: High Resolution Model of Transcription Factor- DNA Affinities Improve In Vitro and In Vivo Binding Predictions Paper by: Phadera Gius,
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Tools for Comparative Sequence Analysis Ivan Ovcharenko Lawrence Livermore National Laboratory.
Cis-regulatory Modules and Module Discovery
Pattern Discovery and Recognition for Genetic Regulation Tim Bailey UQ Maths and IMB.
Introduction to biological molecular networks
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Bayesian Machine learning and its application Alan Qi Feb. 23, 2009.
Local Multiple Sequence Alignment Sequence Motifs
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
Transcription factor binding motifs (part II) 10/22/07.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
Improvement of SSR Redundancy Identification by Machine Learning Approach Using Dataset from Cotton Marker Database Pengfei Xuan 1,2, Feng Luo 2, Albert.
Yiming Kang, Hien-haw Liow, Ezekiel Maier, & Michael Brent
Prediction of Regulatory Elements for Non-Model Organisms Rachita Sharma, Patricia.
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Finding regulatory modules
Nora Pierstorff Dept. of Genetics University of Cologne
Summarized by Sun Kim SNU Biointelligence Lab.
Presentation transcript:

The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components Learning two separate models The frequencies for the Gaussian components of the mixture and the parameters for each component were learned separately for the set of known sites predicted by TestMOTIF and the set of the new sites 2 Classification New sites predicted by TestMOTIF can be classified according to their probability of being generated by the first or the second set of parameters Training Sets Each site was represented as a 5-coordinate vector A positive set was constructed out of 159 known sites that were also discovered by TestMOTIF A negative set was constructed out of 159 randomly chosen sites from the set of new sites predicted by TestMOTIF Kernels The sites were classified using 4 different kernels: Gaussian, Linear, Polynomial and Sigmoidal. Cross-Validation A sevenfold cross-validation was performed to evaluate performance using each one of the kernel functions Linear kernel achieved best cross-validation results Sevenfold cross-validation results for Linear kernel Classification Results A classifier was trained on the full set of 318 sites and managed to separate correctly 88.68% of the training data All new sites predicted by TestMOTIF were tagged by the classifier The threshold for defining true binding sites was set to a positive score of 2 936/73607 (~1.3%) sites received a score above the threshold (222 unique pairs of TF and target gene) Final set included new target genes for 51 known transcription factors Sonya Liberman 1,2, Nir Friedman 1 & Hanah Margalit 2 1 School of Computer Science and Engineering, The Hebrew University, Jerusalem, Israel 2 Department of Molecular Genetics and Biotechnology, Faculty of Medicine, The Hebrew University, Jerusalem, Israel Predicting Novel Transcription Factor Binding Sites in Human Using a Machine Learning Approach False Positives Average value False Positives (%) True Negatives Average value True Negatives (%) False Negatives Average value False Negatives (%) True Positives Average value True Positives (%) % % % % Average log probability of newly predicted sites given a model built according to known sites (159) Average log probability of newly predicted sites given a model built according to newly predicted sites (73607) Average log probability of known sites given a model built according to newly predicted sites (73607) Average log probability of known sites given a model built according to known sites (159) Sinha, S., M. Blanchette, and M. Tompa, PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics, : p Neal, R.M., Regression and classification using Gaussian process priors. Oxford University Press, 1998: p Siepel, A., et al., Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res, (8): p Li, N. and M. Tompa, Analysis of computational approaches for motif discovery. Algorithms Mol Biol, : p Shane T. Jensen, X. Shirley Liu., Qing Zhou and Jun S. Liu, Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective. Statistical Science, (1): p Barash Y., Elidan G., Kaplan T., Friedman N. CIS: compound importance sampling method for protein-DNA binding site p-value estimation. Bioinformatics, ;21(5): p Transcription factors (TFs) regulate gene expression by binding to specific sequences on the DNA. A major challenge is to expand the known repertoire of TF-target pairs by identifying novel Transcription Factor Binding Sites (TFBS) based on sequence data. One main difficulty in such computational predictions is the large number of false positives they generate. Here we examine the association of five features with TFBS and show that they differ between true binding sites and similar sequences that are predicted as binding sites. Using machine learning approaches, we developed a computational scheme for TFBSs prediction, in which prediction of sites based on sequence data is subjected to filtering and further classification according to these features. This results in a significant reduction in the number of false positive predictions and enables the construction of a more accurate transcription regulation network. 1 Known human TFBSs from TRANSFAC database were mapped onto the human genome 210 sites were chosen as a reliable set of known TFBSs Promoters of 150 genes were searched for putative binding sites for 98 different TFs We predicted ~150,000 statistically significant new sites including 174 out of 210 known TFBSs (~83%) True Positives False Positives 83% Known TFBSs False Negatives ? To differentiate between true positive predictions and false positive predictions Evolutionary Conservation Number of neighboring sites with a similar sequence Number of neighboring known binding sites of other transcription factors Gene 1 Gene 2 Clustered TFBS Scattered TFBS Sites for which distance is less than 200 bp are considered neighbors Known TFBS Predicted TFBS Different shapes indicate BSs for different TFs Promoter with a knwon site Gene 1 Gene 2 Promoter without a knwon site Known TFBS Predicted TFBS Distance from the TSS (Transcription Start Site) of the target gene Position relative to TSS Number of real sites 61% of sites are located within the 200 bp upstream to TSS. 75% are located within the 400 bp upstream to TSS Distribution of the distance of sites from their target genes. Only several TFs have a specific binding orientation, i.e.: E2F has a defined orientation of upstream binding sites. (90% have same orientation) EBOX has a defined orientation of downstream binding sites. (86%) Unfortunately only few transcription factors have enough known binding sites to enable reliable statistics. Orientation of the transcription factor binding AACCCA TTGGGT Gene 1 TGGGTT ACCCAA Gene 3 AACCC A TTGGGT Gene 1 AACCC A TTGGGT Gene 3 X 5 Transcription is governed by cis-regulatory elements and associated transcription factors In order to predict new TFBSs we use motifs of known TFBS represented by PSSMS We use a motif search tool (TestMOTIF 6 ) that predicts new TFBS in promoter sequences according to known motifs, and assigns a p-value to each prediction AATGATGC TTACTACG GCATCATT CGTAGTAA AATGATGC TTACTACG GENE 2 Average Conservation Score Sites predicted by a motif search toolKnown Transcription Factor Biding Sites Known TFBSs are on average more conserved than other predicted sites Average number of neighboring known TFBSs Sites predicted by a motif search toolKnown Transcription Factor Biding Sites Average number of sites with a similar sequence Promoter without a known TFBSPromoter with a known TFBS Known TFBSs have on average more neighbors among known TFBSs than other predicted sites do Known TFBSs tend to be surrounded by other sites that match their motif Conservation Distance from TSS Number of neighboring binding sites Orientation of transcription factor binding (X 1,X 2,X 3,X 4,X 5 ) Number of neighboring sites that fit the motif Dr. Yael Altuvia for her help with the feature definition Tommy Kaplan for his help with the TestMOTIF tool The differentiation is made based on the following five features: Evolutionary conservation Number of neighboring known binding sites of other TFS Number of neighboring sites with a similar sequence Distance from the TSS of the target gene Orientation of the transcription factor binding