Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215.

Slides:



Advertisements
Similar presentations
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Profiles for Sequences
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Finding approximate palindromes in genomic sequences.
Transcription factor binding motifs (part I) 10/17/07.
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
Tutorial 5 Motif discovery.
Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, (16 April 2004)
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Ab initio motif finding
Finding Regulatory Motifs in DNA Sequences
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.
Sampling Approaches to Pattern Extraction
Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.
Construction of Substitution Matrices
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap.
Gibbs Sampler in Local Multiple Alignment Review by 온 정 헌.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
Motif discovery and Protein Databases Tutorial 5.
Motif Detection in Yeast Vishakh Joe Bertolami Nick Urrea Jeff Weiss.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Local Multiple Sequence Alignment Sequence Motifs
CS 6243 Machine Learning Advanced topic: pattern recognition (DNA motif finding)
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Construction of Substitution matrices
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
Transcription factor binding motifs (part II) 10/22/07.
Motif identification with Gibbs Sampler Xuhua Xia
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Transcription Regulation Transcription Factor Motif Finding
Motif Finding Continued
A Very Basic Gibbs Sampler for Motif Detection
Gibbs sampling.
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Learning Sequence Motif Models Using Expectation Maximization (EM)
De novo Motif Finding using ChIP-Seq
Mapping Global Histone Acetylation Patterns to Gene Expression
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
Presentation transcript:

Transcription Regulation Transcription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215

Imagine a Chef Restaurant DinnerHome Lunch Certain recipes used to make certain dishes 2

Each Cell Is Like a Chef 3

Infant Skin Adult Liver Glucose, Oxygen, Amino Acid Fat, Alcohol Nicotine Healthy Skin Cell State Disease Liver Cell State Certain genes expressed to make certain proteins 4

Understanding a Genome Get the complete sequence (encoded cook book) Observe gene expressions at different cell states (meals prepared at different situations) Decode gene regulation (decode the book, understand the rules) 5

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATTTACCACATCGCATCACAGTTCAGGACTAGACACGGACG GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG CGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT Information in DNA Milk->Yogurt Beef->Burger Egg->Omelet Fish->Sushi Flour->Cake Coding region 2% What is to be made 6

Information in DNA Non-coding region 98% Regulation: When, Where, Amount, Other Conditions, etc ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATTTACCACATCGCATCACTACGACGGACTAGACACGGACG GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG CGCTAGGTCATCCCAGATCTTGTTCGAATCGCGAATTGCCT Milk->Yogurt Beef->Burger Egg->Omelet Fish->Sushi Flour->Cake Morning Japanese Restaurant 5 Oz 9 Oz Butter Coding region 2% 7

Measure Gene Expression Microarray or SAGE detects the expression of every gene at a certain cell state Clustering find genes that are co-expressed (potentially share regulation) 8

STAT115, 04/01/2008 Decode Gene Regulation GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATTTACCACCAGTTCAGACACGGACGGC GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT Scrambled Egg Bacon Cereal Hash Brown Orange Juice Look at genes always expressed together: Upstream Regions Co-expressed Genes

STAT115, 04/01/2008 Decode Gene Regulation GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATTTACCACCAGTTCAGACACGGACGGC GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT Scrambled Egg Bacon Cereal Hash Brown Orange Juice Look at genes always expressed together: Upstream Regions Co-expressed Genes

STAT115, 04/01/2008 Decode Gene Regulation GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATTTACCACCAGTTCAGACACGGACGGC GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT Scrambled Egg Bacon Cereal Hash Brown Orange Juice Look at genes always expressed together: Upstream Regions Co-expressed Genes Morning

Biology of Transcription Regulation...acatttgcttctgacacaactgtgttcactagcaacctca...aacagacaccATGGTGCACCTGACTCCTGAGGAGAAGTCT......agcaggcccaactccagtgcagctgcaacctgcccactcc...ggcagcgcacATGTCTCTGACCAAGACTGAGAGTGCCGTC......cgctcgcgggccggcactcttctggtccccacagactcag...gatacccaccgATGGTGCTGTCTCCTGCCGACAAGACCAA......gccccgccagcgccgctaccgccctgcccccgggcgagcg...gatgcgcgagtATGGTGCTGTCTCCTGCCGACAAGACCAA... atttgctt ttcact gcaacct aactccagt actca gcaacct ccagcgccg gcaacct Transcription Factor (TF) TF Binding Motif Hemoglobin Beta Hemoglobin Zeta Hemoglobin Alpha Hemoglobin Gamma Motif can only be computational discovered when there are enough cases for machine learning 12

Computational Motif Finding Input data: –Upstream sequences of gene expression profile cluster – sequences, each bps long Output: enriched sequence patterns (motifs) Ultimate goals: –Which TFs are involved and their binding motifs and effects (enhance / repress gene expression)? –Which genes are regulated by this TF, why is there disease when a TF goes wrong? –Are there binding partner / competitor for a TF? 13

Challenges: Where/what the signal The motif should be abundant GAAATATGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATTTACCACCAAATAAGACACGGACGGC GCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAA TCTCGTATTTACCATATTAAATACCCACATCGAGAGCG CGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT Water 14

The motif should be abundant And Abundant with significance GAAATATGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATTTACCACCAAATAAGACACGGACGGC GCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAA TCTCGTATTTACCATATTAAATACCCACATCGAGAGCG CGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT Coconut Challenges: Where/what the signal 15

Challenges: Double stranded DNA Motif appears in both strands GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATGGTAAATACCAGTTCAGACACGGACGGC TCTCAGGTAAATCAGTCATACTACCCACATCGAGAGCG ||||||||||||||||||||||||||||| GTGTAGCGTACCATTTATGGTCAAGTCTG ||||||||||||||||||||||||||||| AGAGTCCATTTAGTCAGTATGATGGGTGT 16

Challenges: Base substitutions Sequences do not have to match the motif perfectly, base substitutions are allowed GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATGTACCACCAGTTCAGACACGGACGGC GCCTCGATTTGCCGTGGTACAGTTCAAACCTGACTAAA TCTCGTTAGGACCATATTTATCACCCACATCGAGAGCG CGCTAGCCAATTACCGATCTTGTTCGAGAATTGCCTAT 17

Challenges: Variable motif copies Some sequences do not have the motif Some have multiple copies of the motif GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGC CACATCGCAATGCAGCAATGCGTTCAGACACGGACGGC TCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCG GCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAA CGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT 18

Challenges: Variable motif copies Some sequences do not have the motif Some have multiple copies of the motif GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGC CACATCGCAATGCAGCAATGCGTTCAGACACGGACGGC TCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCG GCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAA CGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT Sushi Hand Roll Sashimi Tempura Sake Fish 19

Challenges: Two-block motifs Some motifs have two parts GACACATTTACCTATGC TGGCCCTACGACCTCTCGC CACAATTTACCACCA TGGCGTGATCTCAGACACGGACGGC GCCTCGATTTACCGTGGTATGGCTAGTTCTCAAACCTGACTAAA TCTCGTTAGATTTACCACCCA TGGCCGTATCGAGAGCG CGCTAGCCATTTACCGAT TGGCGTTCTCGAGAATTGCCTAT AATGCG GCGTAA or palindromic patterns CoconutMilk 20

Scan for Known TF Motif Sites Experimental TF sites: TRANSFAC, JASPAR TRANSFACJASPAR Motif representation: –Regular expression: Consensus CACAAAA binary decision Degenerate CRCAAAW IUPAC A/TA/G 21

IUPAC for DNA A adenosine Ccytidine Gguanine T thymidine U uridine R G A (purine) Y T C (pyrimidine) K G T (keto) M A C (amino) S G C (strong) W A T (weak) B C G T (not A) D A G T (not C) H A C T (not G) VA C G (not T) NA C G T (any) 22

Scan for Known TF Motif Sites Experimental TF sites: TRANSFAC, JASPAR Motif representation: –Regular expression: Consensus CACAAAA binary decision Degenerate CRCAAAW –Position weight matrix (PWM): need score cutoff Pos ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG Motif Matrix Sites Segment ATGCAGCT score = p(generate ATGCAGCT from motif matrix) p(generate ATGCAGCT from background) p 0 A  p 0 T  p 0 G  p 0 C  p 0 A  p 0 G  p 0 C  p 0 T 23

A Word on Sequence Logo SeqLogo consists of stacks of symbols, one stack for each position in the sequence The overall height of the stack indicates the sequence conservation at that position The height of symbols within the stack indicates the relative frequency of nucleic acid at that position ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG 24

JASPAR User defined cutoff to scan for a particular motif 25

Drawbacks to Known TF Motif Scans Limited number of motifs Limited number of sites to represent each motif –Low sensitivity and specificity Poor description of motif –Binding site borders not clear –Binding site many mismatches Many motifs look very similar –E.g. GC-rich motif, E-box (CACGTG) 26

De Novo Motif Finding 27

De novo Sequence Motif Finding Goal: look for common sequence patterns enriched in the input data (compared to the genome background) Regular expression enumeration –Pattern driven approach –Enumerate k-mers, check significance in dataset Position weight matrix update –Data driven approach, use data to refine motifs –EM & Gibbs samplingEMGibbs sampling –Motif score and Markov backgroundMotif score Markov background 28

Regular Expression Enumeration Oligonucleotide Analysis: check over- representation for every w-mer: –Expected w occurrence in data Consider genome sequence + current data size –Observed w occurrence in data –Over-represented w is potential TF binding motif Observed occurrence of w in the data p w from genome background size of sequence data Expected occurrence of w in the data 29

Suffix Tree for Fast Search Weeder, Pavesi & Pesole 2006 Construction is linear in time and space to length of S. Quickly locating a substring allowing a certain number of mistakes Provides first linear-time solutions for the longest common substring problem Typically requires significantly more space than storing the string itself. 30

Regular Expression Enumeration RE Enumeration Derivatives: –oligo-analysis, spaced dyads w 1.n s.w 2 –IUPAC alphabetIUPAC alphabet –Markov background (later)Markov background –2-bit encoding, fast index access –Enumerate limited RE patterns known for a TF protein structure or interaction theme Exhaustive, guaranteed to find global optimum, and can find multiple motifs Not as flexible with base substitutions, long list of similar good motifs, and limited with motif width 31

Expectation Maximization and Gibbs Sampling Model Objects: –Seq: sequence data to search for motif –  0 : non-motif (genome background) probability –  : motif probability matrix parameter –  : motif site locations Problem: P( ,  | seq,  0 ) Approach: alternately estimate –  by P(  | , seq,  0 ) –  by P(  | , seq,  0 ) –EM and Gibbs differ in the estimation methods 32

Expectation Maximization E step:  | , seq,  0 TTGACGACTGCACGT TTGACp 1 TGACGp 2 GACGAp 3 ACGACp 4 CGACTp 5 GACTGp 6 ACTGCp 7 CTGCAp 8... P 1 = likelihood ratio = P(TTGAC|  ) P(TTGAC|  0 ) p 0 T  p 0 T  p 0 G  p 0 A  p 0 C = 0.3  0.3  0.2  0.3 

Expectation Maximization E step:  | , seq,  0 TTGACGACTGCACGT TTGACp 1 TGACGp 2 GACGAp 3 ACGACp 4 CGACTp 5 GACTGp 6 ACTGCp 7 CTGCAp 8... M step:  | , seq,  0 p 1  TTGAC p 2  TGACG p 3  GACGA p 4  ACGAC... Scale ACGT at each position,  reflects weighted average of  34

M Step TTGACGACTGCACGT 0.8  TTGAC 0.2  TGACG 0.6  GACGA 0.5  ACGAC 0.3  CGACT 0.7  GACTG 0.4  ACTGC 0.1  CTGCA 0.9  TGCAC … 35

EM Derivatives First EM motif finder (C Lawrence) –Deterministic algorithm, guarantee local optimum MEME (TL Bailey) –Prior probability allows 0-n site / sequence –Parallel running multiple EM with different seed –User friendly results 36

Gibbs Sampling Stochastic process, although still may need multiple initializations –Sample  from P(  | , seq,  0 ) –Sample  from P(  | , seq,  0 ) Collapsed form: –  estimated with counts, not sampling from Dirichlet –Sample site from one seq based on sites from other seqs Converged motif matrix  and converged motif sites  represent stationary distribution of a Markov Chain 37

11 22 33 44 55 Gibbs Sampler Initial  1 3131 4141 5151 2121  11 Randomly initialize a probability matrixRandomly initialize a probability matrix n A1 + s A n A1 + s A + n C1 + s C + n G1 + s G + n T1 + s T  estimated with counts p A1 = 38

Gibbs Sampler  1 Without  11 Segment Take out one sequence with its sites from current motifTake out one sequence with its sites from current motif 3131 4141 5151 2121  11 39

Segment (1-8) Sequence 1 Gibbs Sampler Score each possible segment of this sequenceScore each possible segment of this sequence  1 Without  11 Segment 3131 4141 5151 2121 40

Segment (2-9) Sequence 1 Gibbs Sampler Score each possible segment of this sequenceScore each possible segment of this sequence 3131 4141 5151 2121  1 Without  11 Segment 41

Segment Score Use current motif matrix to score a segment Pos ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG Motif Matrix Sites Segment ATGCAGCT score = p(generate ATGCAGCT from motif matrix) p(generate ATGCAGCT from background) p 0 A  p 0 T  p 0 G  p 0 C  p 0 A  p 0 G  p 0 C  p 0 T 42

Scoring Segments Motif12345bg A T G C Ignore pseudo counts for now… Sequence: TTCCATATTAATCAGATTCCG… score TAATC… AATCA0.4/0.3 x 0.1/0.3 x 0.1/0.3 x 0.1/0.2 x 0.2/0.3 = ATCAG0.4/0.3 x 0.5/0.3 x 0.4/0.2 x 0.4/0.3 x 0.4/0.2 = TCAGA0.2/0.3 x 0.2/0.3 x 0.3/0.3 x 0.3/0.2 x 0.2/0.3 = CAGAT… 43

 12 Gibbs Sampler Sample site from one seq based on sites from other seqs 3131 4141 5151 2121 Modified  1  estimated with counts 44

Hill Climbing vs Sampling 45 Pos Score SubT Rand(subtotal) = X Find the first position with subtotal larger than X Pos Score SubT

Gibbs Sampler Repeat the process until motif convergesRepeat the process until motif converges  1 Without  21 Segment 3131 4141 5151  12 2121 46

Gibbs Sampler Intuition Beginning: –Randomly initialized motif –No preference towards any segment 47

Gibbs Sampler Intuition Motif appears: –Motif should have enriched signal (more sites) –By chance some correct sites come to alignment –Sites bias motif to attract other similar sites 48

Gibbs Sampler Intuition Motif converges: –All sites come to alignment –Motif totally biased to sample sites every time 49

11 22 33 44 55 Gibbs Sampler  3i 4i4i  5i  2i  1i Column shift Metropolis algorithm: –Propose  * as  shifted 1 column to left or right –Calculate motif score u(  ) and u(  *) –Accept  * with prob = min(1, u(  *) / u(  )) 50

Summary Biology and challenge of transcription regulation Scan for known TF motif sites: TRANSFAC & JASPAR De novo method –Regular expression enumeration Oligonucleotide analysis –Position weight matrix update EM (iterate ,  ;  ~ weighted  average) Gibbs Sampler (sample ,  ; Markov chain convergence) 51