Conference Report: Recomb Satellite NYC, Nov 2010 DREAM, Systems Biology and Regulatory Genomics.

Conference Report: Recomb Satellite NYC, Nov 2010 DREAM, Systems Biology and Regulatory Genomics

DREAM reverse engineering challenges The first part was DREAM, which is a reverse engineering competition. It was one day in which the best performers of the different challenges presented their solutions. I was invited to present as best performer of the bonus round of challenge 2. When comparing to the other best performer, our method was as accurate and much faster.

Reminder of the challenge

Our predictions against the gold standard TF numberGold StandardACGT team answer TF_1Ar TF_2DbpCes2 TF_3Foxo6Foxf2 TF_4Klf12Klf16 TF_5Klf8Klf7 TF_6Klf9Mybl2 TF_7MlxMyc TF_8Mzf1Nfkb1 TF_9Mzf1Nfkb2 TF_10Nfil3 TF_11Nr2f6Nr2f1 TF_12Nr4a2Pparg TF_13Pou2f1 TF_14MypopPou3f2 TF_15Pou1f1Pou3f4 TF_16Prdm11Pou5f1 TF_17RorbRora TF_18Sox10Sox2 TF_19Sox3Sox5 TF_20Sox6Sox9 TF_21Srebf1 TF_22Tbx2Tbx1 TF_23Tbx20Tbx21 TF_24Tbx4Tbx5 TF_25Tbx5Tbx6 TF_26TcfecUsf26 TF_27Xbp1 TF_28Zfp202Zfp281 TF_29Zfp263Zfp691 TF_30Zfp3Zfpm1 TF_31Zfx TF_32Zkscan1Zscan4e TF_33Zscan10Zscan4f

TF numberGold StandardACGT team answer TF_34Ahctf1Arid3a TF_35Atf3Atf1 TF_36Atf4Atf2 TF_37Dnajc21Cfl2 TF_38Dmrtc2Dmrt2 TF_39Egr3Egr1 TF_40EsrrbEsr2 TF_41EsrrgEsrra TF_42Foxc2Foxj1 TF_43Foxg1Foxl1 TF_44Gata4Gata5 TF_45Mybl2Mybl1 TF_46Nhlh2Nhlh1 TF_47Nkx2-9Nkx3-2 TF_48Nr2e1Nr1h2 TF_49Nr2f1Ppard TF_50Nr5a2Pparg TF_51Pou1f1Prrx2 TF_52RargRara TF_53Rfx7Rfx4 TF_54RoraRxra TF_55Sdccag8Tbp TF_56Snai1Tgif1 TF_57Sp140Tgif2 TF_58Tbx1Tgif2lx1 TF_59Zbtb1Tgif2lx2 TF_60Zfp300Xbp1 TF_61Zfp637Zfp128 TF_62Zic5Zic1 TF_63Zkscan5Zic2 TF_64Zfp740Zic3 TF_65Zscan10Zic4 TF_66Zscan10Mzf1

Systems Biology and Regulatory Genomics DREAM was followed by: 1.Systems Biology Pathway inference and reverse engineering of cellular networks. Cellular signatures of biological responses and disease states. Phosphorylation, metabolic fluxes, systematic phenotyping. Mathematical modeling and simulation of biological systems. 2. Regulatory Genomics Modeling and recognition of regulatory motifs and modules. Chromatin state establishment, maintenance, and role in development. Post-transcriptional regulation and small regulatory RNAs. Regulatory networks, metabolic networks, proteomic networks.

Computational Identification of specific cis-regulatory elements using sequence and expression data Rahul Karnik, Michale Beeer Department of Biomedical Engineering, Johns Hopkins University School of Medicine

Introduction Current approaches to motif finding typically consist of two steps: 1.Identification of sets of co-regulated genes based on their expression patterns, usually by clustering 2.Searching for overrepresented sequence motifs in the upstream sequences of each set of related genes by Gibbs sampling or expectation maximization

Cluster I Cluster II Cluster III Gene expression microarrays Clustering Location analysis (ChIP-chip, …) Functional group (e.g., GO term) Motif discovery: The two-step pipeline Promoter/3’UTR sequences Motif discovery Co-regulated gene set

The new algorithm, Inspector, integrates upstream sequence and expression data to find co-expressed genes with a sequence motif that is specific to that group of genes. Inspector addresses two limitations of the current approach: 1.An integrated model reduces the effect of noise in expression data. 2.Optimizing for specificity prevents the identification of ubiquitous sequence motifs as determinants of expression.

Algorithm Inspector is an iterative Gibbs sampling algorithm, with the objective function being the specificity of the sequence motif to the genes in the current search set, i.e. having similar expression profiles. Given N total sequences, s1 of which have the motif, s2 of which are similarly expressed, and x of which are in the intersection of these sets.

The specificity score is the hyper-geometric tail, i.e. the probability that at least x genes of s1 are in the intersection. S1S2 ≥ x N

The integrated model has two components: 1.The sequence model, a position weight matrix derived from candidate motif instances 2.The expression model, the mean expression profile of the genes currently in the model Sequence and expression thresholds are adjusted at regular intervals to minimize the specificity score. 8.47.23.2451275982.31.23.43.91.51.32.75.6 654321 00.20.700.80.1A 0.60.40.10.50.10C 0.40.10.500G 0.30.10 0.9T

The integrated model, which is composed of a sequence component and an expression component, is iteratively refined to maximize the objective function, specificity of the motif.

Initialization: a random gene and position is picked and this k-mer is the initial PWM and the expression profile of the gene. Several initialization values are tried. The process halts when the specificity is no longer improved. The new model in each iteration is the average of the PWM and the expression profile. For expression profile similarity, they use Pearson Correlation Coefficient:

The PWM match scores PWM match score is taken from AlignAce (Hughes J. et al. 98), only the background model is 3 rd -5 th order Markov Model (instead of 0 th ). The score S for a site Q whose sequence as a function of position is given by q(p): AAAACCGTTCAGTCAGGTCATAGC And matrix M (next slide): ≈ log ∏(frequency of q(p) in the PWM)

F p,b is the number of bases of type b aligned at position p, N is the number of aligned sites, and p b is the genomic background nucleotide frequency for base b. The first term corresponds to the log of the frequency of a given base at a particular position in the motif alignment, estimated with a Bayesian prior distribution corresponding to the genomic mononucelotide frequencies and a total pseudocount of 1. 654321 00.20.700.80.1A 0.60.40.10.50.10C 0.40.10.500G 0.30.10 0.9T = F p,b / N

Synthetic Datasets used The synthetic sequence dataset consisted of 5000 sequences divided into 80 sets of varying size. All the sequences in a set were seeded with one common functional motif and four ubiquitous motifs (picked randomly from 20 false motifs), with non-motif sequence having the same nucleotide frequencies as yeast intergenic sequence. ACGTCAGTGCGATACGATGCTGAGCCTGGAAAAACCGTTCAGTCAGGTCATAGC Pool of false motifs real motif

Every set of genes was assigned a mean expression profile across 50 conditions, corresponding to regulatory control by one functional motif. Each gene in a set was then assigned an expression profile around this mean profile with Gaussian noise. 8734512759821331325 + Additive Gaussian Noise = 8.47.23.2451275982.31.23.43.91.51.32.75.6

Results Inspector performs better at detecting motifs in synthetic sequence and expression datasets than the combination of k-means clustering and AlignACE. The sequence dataset was created to mimic the basepair composition and length of yeast intergenic sequence, while the expression data matches pairwise correlation characteristics of real yeast expression datasets. 1-Specificity = FALSE NEGATIVE Sensitivity = TRUE POSITIVE

Real Datasets Used Saccharomyces cereviseae datasets The sequence dataset was the upstream sequence for all yeast ORFs. The expression dataset was a combination of three different original datasets (Brauer08, Gasch00, Spellnab98) and profiled all yeast ORFs over 292 conditions. Caenorhabditis elegans datasets The sequence dataset consisted of up to 2kb of upstream sequence for 5691 genes. The expression dataset was the same as that used by Beer and Tavazoie (2004). It contains 255 conditions.

Inspector detects more known motifs than the combination of k-means clustering and AlignACE. There were 97 known motifs in total (Harbison 2004). A CompareACE score of 0.75 or greater was considered a match. ChIP target sets (Harbison04) were considered a match if the hypergeometric p-value for overlap was less than 10-7.

The first is a known motif. The two others are new motifs in C. elegans, which are candidates for experimental validation.

Inference of binding specificity from protein binding domains (work from the group of Tim Hughes at University of Toronto, presented by Matt Weirauch) This is an ambitious study to infer binding specificity of TFs in eukaryotes using protein domain similarity. It is well known that similar TFs (i.e. from the same TF family) have similar binding sites and binding specificities. The goal is to infer the binding specificity according to the binding domain of the TF and its similarity to other TFs whose binding motifs are known. Studies in the field of motif finding

Their aims are three-fold: 1.Use PBM data to refine and test rules for inference of TF sequence specificity. 2.Generate the data needed to produce accurate “Pfam-wide” inferences of sequences specificity for as many eukaryotic DNA-binding domain classes as possible. 3.Construct a DB to house both known and inferred sequence preferences for eukaryotes with available genomic sequences.

Discriminative motif finding (work from the group of Ziv Bar Joseph at CMU, presented by Shan Zhong) The algorithm looks for a motif that best discriminates between a positive set of sequences and a negative set (in which sequences are supposed not to contain the motif). They use a generative mixture model for k-mer distributions that can be viewed as 0 th -order HMM. The user specifies a motif length k.

The method extracts all k-mers from the positive and negative sequences, and then searches for a position weight matrix that maximizes a discriminative target function. This function represents the difference in the expected number of times that the mixture component was used in the HMM to generate the positive and negative sequences. The running time is independent of the input sequence size (depends only on k).

Motif finding in mRNA's UTR region (work from the group of Tim Hughes at University of Toronto, presented by Quaid Morris) The main contribution here is that a motif is not only represented by its sequence, but by its structural parameters as well. RNA has specific structure, which affects the protein's binding to it. The novelty here is that those structural features are incorporated in the model that represents the motif. MalaRKey is a new motif finding method that uses a feature-based product model to represent RBP binding affinity for a given site.

The structural features are: 1.The site is in a hairpin loop. 2.1 st base in site paired and the rest in hairpin loop. 3.Tendency of particular subsequences to share the same secondary structure context. A nice result they showed is that when the motif sequence is of length 4 there is preference to binding to specific RNA structures, and as the motif length increases, the preference decreases and by length 7 there's almost no structural preference. This means that some of the information is encoded in the structure together with the sequence.

Conference Report: Recomb Satellite NYC, Nov 2010 DREAM, Systems Biology and Regulatory Genomics.

Similar presentations

Presentation on theme: "Conference Report: Recomb Satellite NYC, Nov 2010 DREAM, Systems Biology and Regulatory Genomics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Conference Report: Recomb Satellite NYC, Nov 2010 DREAM, Systems Biology and Regulatory Genomics.

Similar presentations

Presentation on theme: "Conference Report: Recomb Satellite NYC, Nov 2010 DREAM, Systems Biology and Regulatory Genomics."— Presentation transcript:

Similar presentations

About project

Feedback