Conference Report: Recomb Satellite NYC, Nov 2010 DREAM, Systems Biology and Regulatory Genomics.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome ECS289A.
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Improving miRNA Target Genes Prediction Rikky Wenang Purbojati.
March 03 Identification of Transcription Factor Binding Sites Presenting: Mira & Tali.
Gene regulatory network
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
Transcription factor binding motifs (part I) 10/17/07.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, (16 April 2004)
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Bryan Heck Tong Ihn Lee et al Transcriptional Regulatory Networks in Saccharomyces cerevisiae.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
Comparative Expression Moran Yassour +=. Goal Build a multi-species gene-coexpression network Find functions of unknown genes Discover how the genes.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Inferring Cellular Networks Using Probabilistic Graphical Models Jianlin Cheng, PhD University of Missouri 2009.
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
ChIP-on-Chip and Differential Location Analysis Junguk Hur School of Informatics October 4, 2005.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
High-resolution computational models of genome binding events Yuan (Alan) Qi Joint work with Gifford and Young labs Dana-Farber Cancer Institute Jan 2007.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Identification of cell cycle-related regulatory motifs using a kernel canonical correlation analysis Presented by Rhee, Je-Keun Graduate Program in Bioinformatics.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Journal report: High Resolution Model of Transcription Factor- DNA Affinities Improve In Vitro and In Vivo Binding Predictions Paper by: Phadera Gius,
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Cis-regulatory Modules and Module Discovery
Cluster validation Integration ICES Bioinformatics.
Flat clustering approaches
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function Sara Mostafavi, Debajyoti Ray, David Warde-Farley,
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Motif Search and RNA Structure Prediction Lesson 9.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
Transcription factor binding motifs (part II) 10/22/07.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
BIOBASE Training TRANSFAC ® Containing data on eukaryotic transcription factors, their experimentally-proven binding sites, and regulated genes ExPlain™
bacteria and eukaryotes
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Volume 3, Issue 1, Pages (July 2016)
Finding regulatory modules
Nora Pierstorff Dept. of Genetics University of Cologne
Predicting Gene Expression from Sequence
BIOBASE Training TRANSFAC® ExPlain™
Label propagation algorithm
Presentation transcript:

Conference Report: Recomb Satellite NYC, Nov 2010 DREAM, Systems Biology and Regulatory Genomics

DREAM reverse engineering challenges The first part was DREAM, which is a reverse engineering competition. It was one day in which the best performers of the different challenges presented their solutions. I was invited to present as best performer of the bonus round of challenge 2. When comparing to the other best performer, our method was as accurate and much faster.

Reminder of the challenge

Our predictions against the gold standard TF numberGold StandardACGT team answer TF_1Ar TF_2DbpCes2 TF_3Foxo6Foxf2 TF_4Klf12Klf16 TF_5Klf8Klf7 TF_6Klf9Mybl2 TF_7MlxMyc TF_8Mzf1Nfkb1 TF_9Mzf1Nfkb2 TF_10Nfil3 TF_11Nr2f6Nr2f1 TF_12Nr4a2Pparg TF_13Pou2f1 TF_14MypopPou3f2 TF_15Pou1f1Pou3f4 TF_16Prdm11Pou5f1 TF_17RorbRora TF_18Sox10Sox2 TF_19Sox3Sox5 TF_20Sox6Sox9 TF_21Srebf1 TF_22Tbx2Tbx1 TF_23Tbx20Tbx21 TF_24Tbx4Tbx5 TF_25Tbx5Tbx6 TF_26TcfecUsf26 TF_27Xbp1 TF_28Zfp202Zfp281 TF_29Zfp263Zfp691 TF_30Zfp3Zfpm1 TF_31Zfx TF_32Zkscan1Zscan4e TF_33Zscan10Zscan4f

TF numberGold StandardACGT team answer TF_34Ahctf1Arid3a TF_35Atf3Atf1 TF_36Atf4Atf2 TF_37Dnajc21Cfl2 TF_38Dmrtc2Dmrt2 TF_39Egr3Egr1 TF_40EsrrbEsr2 TF_41EsrrgEsrra TF_42Foxc2Foxj1 TF_43Foxg1Foxl1 TF_44Gata4Gata5 TF_45Mybl2Mybl1 TF_46Nhlh2Nhlh1 TF_47Nkx2-9Nkx3-2 TF_48Nr2e1Nr1h2 TF_49Nr2f1Ppard TF_50Nr5a2Pparg TF_51Pou1f1Prrx2 TF_52RargRara TF_53Rfx7Rfx4 TF_54RoraRxra TF_55Sdccag8Tbp TF_56Snai1Tgif1 TF_57Sp140Tgif2 TF_58Tbx1Tgif2lx1 TF_59Zbtb1Tgif2lx2 TF_60Zfp300Xbp1 TF_61Zfp637Zfp128 TF_62Zic5Zic1 TF_63Zkscan5Zic2 TF_64Zfp740Zic3 TF_65Zscan10Zic4 TF_66Zscan10Mzf1

Systems Biology and Regulatory Genomics DREAM was followed by: 1.Systems Biology Pathway inference and reverse engineering of cellular networks. Cellular signatures of biological responses and disease states. Phosphorylation, metabolic fluxes, systematic phenotyping. Mathematical modeling and simulation of biological systems. 2. Regulatory Genomics Modeling and recognition of regulatory motifs and modules. Chromatin state establishment, maintenance, and role in development. Post-transcriptional regulation and small regulatory RNAs. Regulatory networks, metabolic networks, proteomic networks.

Computational Identification of specific cis-regulatory elements using sequence and expression data Rahul Karnik, Michale Beeer Department of Biomedical Engineering, Johns Hopkins University School of Medicine

Introduction Current approaches to motif finding typically consist of two steps: 1.Identification of sets of co-regulated genes based on their expression patterns, usually by clustering 2.Searching for overrepresented sequence motifs in the upstream sequences of each set of related genes by Gibbs sampling or expectation maximization

Cluster I Cluster II Cluster III Gene expression microarrays Clustering Location analysis (ChIP-chip, …) Functional group (e.g., GO term) Motif discovery: The two-step pipeline Promoter/3’UTR sequences Motif discovery Co-regulated gene set

The new algorithm, Inspector, integrates upstream sequence and expression data to find co-expressed genes with a sequence motif that is specific to that group of genes. Inspector addresses two limitations of the current approach: 1.An integrated model reduces the effect of noise in expression data. 2.Optimizing for specificity prevents the identification of ubiquitous sequence motifs as determinants of expression.

Algorithm Inspector is an iterative Gibbs sampling algorithm, with the objective function being the specificity of the sequence motif to the genes in the current search set, i.e. having similar expression profiles. Given N total sequences, s1 of which have the motif, s2 of which are similarly expressed, and x of which are in the intersection of these sets.

The specificity score is the hyper-geometric tail, i.e. the probability that at least x genes of s1 are in the intersection. S1S2 ≥ x N

The integrated model has two components: 1.The sequence model, a position weight matrix derived from candidate motif instances 2.The expression model, the mean expression profile of the genes currently in the model Sequence and expression thresholds are adjusted at regular intervals to minimize the specificity score A C G T

The integrated model, which is composed of a sequence component and an expression component, is iteratively refined to maximize the objective function, specificity of the motif.

Initialization: a random gene and position is picked and this k-mer is the initial PWM and the expression profile of the gene. Several initialization values are tried. The process halts when the specificity is no longer improved. The new model in each iteration is the average of the PWM and the expression profile. For expression profile similarity, they use Pearson Correlation Coefficient:

The PWM match scores PWM match score is taken from AlignAce (Hughes J. et al. 98), only the background model is 3 rd -5 th order Markov Model (instead of 0 th ). The score S for a site Q whose sequence as a function of position is given by q(p): AAAACCGTTCAGTCAGGTCATAGC And matrix M (next slide): ≈ log ∏(frequency of q(p) in the PWM)

F p,b is the number of bases of type b aligned at position p, N is the number of aligned sites, and p b is the genomic background nucleotide frequency for base b. The first term corresponds to the log of the frequency of a given base at a particular position in the motif alignment, estimated with a Bayesian prior distribution corresponding to the genomic mononucelotide frequencies and a total pseudocount of A C G T = F p,b / N

Synthetic Datasets used The synthetic sequence dataset consisted of 5000 sequences divided into 80 sets of varying size. All the sequences in a set were seeded with one common functional motif and four ubiquitous motifs (picked randomly from 20 false motifs), with non-motif sequence having the same nucleotide frequencies as yeast intergenic sequence. ACGTCAGTGCGATACGATGCTGAGCCTGGAAAAACCGTTCAGTCAGGTCATAGC Pool of false motifs real motif

Every set of genes was assigned a mean expression profile across 50 conditions, corresponding to regulatory control by one functional motif. Each gene in a set was then assigned an expression profile around this mean profile with Gaussian noise Additive Gaussian Noise =

Results Inspector performs better at detecting motifs in synthetic sequence and expression datasets than the combination of k-means clustering and AlignACE. The sequence dataset was created to mimic the basepair composition and length of yeast intergenic sequence, while the expression data matches pairwise correlation characteristics of real yeast expression datasets. 1-Specificity = FALSE NEGATIVE Sensitivity = TRUE POSITIVE

Real Datasets Used Saccharomyces cereviseae datasets The sequence dataset was the upstream sequence for all yeast ORFs. The expression dataset was a combination of three different original datasets (Brauer08, Gasch00, Spellnab98) and profiled all yeast ORFs over 292 conditions. Caenorhabditis elegans datasets The sequence dataset consisted of up to 2kb of upstream sequence for 5691 genes. The expression dataset was the same as that used by Beer and Tavazoie (2004). It contains 255 conditions.

Inspector detects more known motifs than the combination of k-means clustering and AlignACE. There were 97 known motifs in total (Harbison 2004). A CompareACE score of 0.75 or greater was considered a match. ChIP target sets (Harbison04) were considered a match if the hypergeometric p-value for overlap was less than 10-7.

The first is a known motif. The two others are new motifs in C. elegans, which are candidates for experimental validation.

Inference of binding specificity from protein binding domains (work from the group of Tim Hughes at University of Toronto, presented by Matt Weirauch) This is an ambitious study to infer binding specificity of TFs in eukaryotes using protein domain similarity. It is well known that similar TFs (i.e. from the same TF family) have similar binding sites and binding specificities. The goal is to infer the binding specificity according to the binding domain of the TF and its similarity to other TFs whose binding motifs are known. Studies in the field of motif finding

Their aims are three-fold: 1.Use PBM data to refine and test rules for inference of TF sequence specificity. 2.Generate the data needed to produce accurate “Pfam-wide” inferences of sequences specificity for as many eukaryotic DNA-binding domain classes as possible. 3.Construct a DB to house both known and inferred sequence preferences for eukaryotes with available genomic sequences.

Discriminative motif finding (work from the group of Ziv Bar Joseph at CMU, presented by Shan Zhong) The algorithm looks for a motif that best discriminates between a positive set of sequences and a negative set (in which sequences are supposed not to contain the motif). They use a generative mixture model for k-mer distributions that can be viewed as 0 th -order HMM. The user specifies a motif length k.

The method extracts all k-mers from the positive and negative sequences, and then searches for a position weight matrix that maximizes a discriminative target function. This function represents the difference in the expected number of times that the mixture component was used in the HMM to generate the positive and negative sequences. The running time is independent of the input sequence size (depends only on k).

Motif finding in mRNA's UTR region (work from the group of Tim Hughes at University of Toronto, presented by Quaid Morris) The main contribution here is that a motif is not only represented by its sequence, but by its structural parameters as well. RNA has specific structure, which affects the protein's binding to it. The novelty here is that those structural features are incorporated in the model that represents the motif. MalaRKey is a new motif finding method that uses a feature-based product model to represent RBP binding affinity for a given site.

The structural features are: 1.The site is in a hairpin loop. 2.1 st base in site paired and the rest in hairpin loop. 3.Tendency of particular subsequences to share the same secondary structure context. A nice result they showed is that when the motif sequence is of length 4 there is preference to binding to specific RNA structures, and as the motif length increases, the preference decreases and by length 7 there's almost no structural preference. This means that some of the information is encoded in the structure together with the sequence.