Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.

Slides:



Advertisements
Similar presentations
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Advertisements

Predicting Enhancers in Co-Expressed Genes Harshit Maheshwari Prabhat Pandey.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
STRATEGY FOR GENE REGULATION 1.INFORMATION IN NUCLEIC ACID – CIS ELEMENT CIS = NEXT TO; ACTS ONLY ON THAT MOLECULE 2.TRANS FACTOR (USUALLY A PROTEIN) BINDS.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Control of Prokaryotic Gene Expression. Prokaryotic Regulation of Genes Regulating Biochemical Pathway for Tryptophan Synthesis. 1.Produce something that.
Gene Finding BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Bioinformatics Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 3 Finding Motifs Aleppo University Faculty of technical engineering.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
Transcription factor binding motifs (part I) 10/17/07.
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Introduction to BioInformatics GCB/CIS535
Discovery of RNA Structural Elements Using Evolutionary Computation Authors: G. Fogel, V. Porto, D. Weekes, D. Fogel, R. Griffey, J. McNeil, E. Lesnik,
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Fuzzy K means.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Ab initio motif finding
Finding Regulatory Motifs in DNA Sequences
REGULATORY GENOMICS Saurabh Sinha, Dept. of Computer Science & Institute of Genomic Biology, University of Illinois.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Computational Molecular Biology Biochem 218 – BioMedical Informatics Gene Regulatory.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Regulatory factors 1) Gene copy number 2) Transcriptional control 2-1) Promoters 2-2) Terminators, attenuators and anti-terminators 2-3) Induction and.
Functional genomics + Data mining BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Finish up array applications Move on to proteomics Protein microarrays.
Sequence analysis – an overview A.Krishnamachari
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Sampling Approaches to Pattern Extraction
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Bayesian Machine learning and its application Alan Qi Feb. 23, 2009.
Biological Motif Discovery Concepts Motif Modeling and Motif Information EM and Gibbs Sampling Comparative Motif Prediction Applications Transcription.
Local Multiple Sequence Alignment Sequence Motifs
Regulation of Gene Expression in Bacteria and Their Viruses
CS 6243 Machine Learning Advanced topic: pattern recognition (DNA motif finding)
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
Transcription factor binding motifs (part II) 10/22/07.
Motif identification with Gibbs Sampler Xuhua Xia
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
CS5263 Bioinformatics Lecture 11 Motif finding. HW2 2(C) Click to find out K and lambda.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Chapter 15. I. Prokaryotic Gene Control  A. Conserves Energy and Resources by  1. only activating proteins when necessary  a. don’t make tryptophan.
Chapter 15. I. Prokaryotic Gene Control  A. Conserves Energy and Resources by  1. only activating proteins when necessary  a. don’t make tryptophan.
BIOBASE Training TRANSFAC ® Containing data on eukaryotic transcription factors, their experimentally-proven binding sites, and regulated genes ExPlain™
Warm Up Write down 5 times it would be beneficial for a gene to be ‘turned off’ and the protein not be expressed 1.
A Very Basic Gibbs Sampler for Motif Detection
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Learning Sequence Motif Models Using Expectation Maximization (EM)
Recitation 7 2/4/09 PSSMs+Gene finding
Albert Xue, Binbin Huang, Jianrong Wang
Regulation of Transcription Initiation
BIOBASE Training TRANSFAC® ExPlain™
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
Presentation transcript:

Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring 2015

Nature Communications 4, Article number: 2078 doi: /ncomms3078 RamR represses the ramA gene, which encodes the activator protein for the acrAB drug efflux pump genes. An example transcriptional regulatory cascade Here, controlling Salmonella bacteria multidrug resistance RamR dimer Sequence- specific DNA binding

Antimicrob Agents Chemother. Feb 2012; 56(2): 942–948. Historically, DNA and RNA binding sites were defined biochemically (DNAse footprinting, gel shift assays, etc.) Hydroxyl radical footprinting of ramR-ramA intergenic region with RamR

Historically, DNA and RNA binding sites were defined biochemically (DNAse footprinting, gel shift assays, etc.) Now, many binding motifs are discovered bioinformatically Isolate different nucleic acid segments bound by copies of the protein (e.g. all sites bound across a genome) Sequence Search computationally for recurring motifs Image: Antimicrob Agents Chemother. Feb 2012; 56(2): 942–948.

Transcription factor regulatory networks can be highly complex, e.g. as for embryonic stem cell regulators TFtarget PPI TF

MOTIFS Binding sites of the transcription factor ROX1 consensus frequencies scaled by information content freq of nuc b in genome frequency of nuc b at position i

So, here’s the challenge: Given a set of DNA sequences that contain a motif (e.g., promoters of co-expressed genes), how do we discover it computationally?

Could we just count all instances of each k-mer? Why or why not?  promoters and DNA binding sites are not well conserved

How does motif discovery work?

Update the motif model Assign sites to motif Update the motif model Assign sites to motif etc. What does this process remind you of?

How does motif discovery work? Motif finding often uses expectation-maximization (like the k-means clustering we already learned about), i.e. alternating between building/updating a motif model and assigning sequences to that motif model. Searches the space of possible motifs for optimal solutions without testing everything. Most common approach = Gibbs sampling

We will consider N sequences, each with a motif of length w: N seqs k A k = position in seq k of motif q ij = probability of finding nucleotide (or aa) j at position i in motif i ranges from 1 to w j ranges across the nucleotides (or aa) p j = background probability of finding nucleotide (or aa) j w

Start by choosing w and randomly positioning each motif: N seqs k A k = position in seq k of motif q ij = probability of finding nucleotide (or aa) j at position i in motif i ranges from 1 to w j ranges across the nucleotides (or aa) p j = background probability of finding nucleotide (or aa) j Completely randomly positioned! NOTE: You won’t give any information at all about what or where the motif should be!

Predictive update step: Randomly choose one sequence, calculate q ij and p j from N-1 remaining sequences q ij = probability of finding nucleotide (or aa) j at position i in motif i ranges from 1 to w j ranges across the nucleotides (or aa) p j = background probability of finding nucleotide (or aa) j Randomly choose Update model w/ these background frequency of symbol j count of symbol j at position i bjbj p j is calculated similarly from the counts outside the motifs

Stochastic sampling step: For withheld sequence, slide motif down sequence & calculate agreement with model Withheld sequence Odds ratio of agreement with model vs. background Position in sequence  q ij   p j  c xij (see the paper for details)

Stochastic sampling step: For withheld sequence, slide motif down sequence & calculate agreement with model Withheld sequence Odds ratio of agreement with model vs. background Position in sequence  q ij   p j  c xij (see the paper for details) Here’s the cool part. DON’T just choose the maximum. INSTEAD, select a new A k position proportional to this odds ratio. Then, choose a new sequence to withhold, and repeat everything.

Over many iterations, this magically converges to the most enriched motifs. Note, it’s stochastic: 3 runs on the same data

Measure mRNA abundances using DNA microarrays Search for motifs in promoters of glucose vs galactose controlled genes Discovered motifs Known motif Galactose upstream activation sequence “AlignAce”

Measure mRNA abundances using DNA microarrays Search for motifs in promoters of heat- induced and repressed genes Discovered motifs Known motif Cell cycle activation motif, histone activator “AlignAce”

e.g., If you need them, we now know the binding motifs for 100’s of transcription factors at 1000’s of distinct sites in the human genome, including many new motifs.