Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Slides:



Advertisements
Similar presentations
A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007.
Advertisements

Basic Gene Expression Data Analysis--Clustering
Copyright Pearson Prentice Hall
. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Periodic clusters. Non periodic clusters That was only the beginning…
Hidden Markov Model in Biological Sequence Analysis – Part 2
Random Projection Approach to Motif Finding Adapted from RandomProjections.ppt.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
Gene Regulation and Microarrays. Overview A. Gene Expression and Regulation B. Measuring Gene Expression: Microarrays C. Finding Regulatory Motifs.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
Comparative Motif Finding
Transcription factor binding motifs (part I) 10/17/07.
DNA Regulatory Binding Motif Search Dong Xu Computer Science Department 109 Engineering Building West
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Introduction to Bioinformatics Algorithms Randomized Algorithms and Motif Finding.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Sequence Databases As DNA and protein sequences accumulate, they are deposited in public databases. One of the most popular of these is GenBank, which.
Finding Regulatory Motifs in DNA Sequences
Motif Refinement using Hybrid Expectation Maximization Algorithm Chandan Reddy Yao-Chung Weng Hsiao-Dong Chiang School of Electrical and Computer Engr.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Introduction to Bioinformatics Algorithms Randomized Algorithms and Motif Finding.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Discovery of Regulatory Elements by a Phylogenetic Footprinting Algorithm Mathieu Blanchette Martin Tompa Computer Science & Engineering University of.
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
Sequence analysis – an overview A.Krishnamachari
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Gibbs Sampler in Local Multiple Alignment Review by 온 정 헌.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Identification of cell cycle-related regulatory motifs using a kernel canonical correlation analysis Presented by Rhee, Je-Keun Graduate Program in Bioinformatics.
Lecture 5 Motif discovery. Signals in DNA Genes Promoter regions Binding sites for regulatory proteins (transcription factors, enhancer modules, motifs)
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
From Genomes to Genes Rui Alves.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
Transcription factor binding motifs (part II) 10/22/07.
Motif identification with Gibbs Sampler Xuhua Xia
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Microarray: An Introduction
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
A Very Basic Gibbs Sampler for Motif Detection
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Learning Sequence Motif Models Using Expectation Maximization (EM)
Algorithms for Regulatory Motif Discovery
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
Presentation transcript:

Yueyi Irene Liu CS374 Lecture Oct. 17, 2002 Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002

Outline Background biology Motif-finding methods Word enumeration Gibbs sampling Random projection Phylogenetic footprinting Reducer

Regulation of Gene Expression Chromatin structure Transcription initiation Transcript processing and modification RNA transport Transcript stability Translation initiation Post-Translational Modification Protein Transport Control of Protein Stability

Typical Structure of an Eukaryotic mRNA Gene

Control of Transcription Initiation

Motif A conserved pattern that is found in two or more sequences Can be found in DNA (e.g., transcription factor binding sites) Protein RNA

Models for Representing Motifs Regular expression Consensus TGACGCA Degenerate WGACRCA Position Specific Matrix TGACGCA AGACGCA TGACACA 1 2 3 4 5 6 7 A 0.4 0.2 T 0.6 G 0.8 C

Where to look for motifs? Gene families: a set of genes controlled by a common transcription factor or common environmental stimulus How do you construct gene families? Microarray experiments

Microarrays experiments genes Cells of Interest Known DNA sequences 10 Microarrays Isolate mRNA Cells of Interest Reference sample Known DNA sequences Glass slide genes Resulting data 3.25 3.01 1.30 0.70 6.73 2.89 0.92 0.67 1.14 1.15 0.60 0.23 2.12 6.12 0.07 0.02 experiments Proteins can be measured by measuring DNA-like molecule called mRNA. Labeled mRNA (cdna) can selectively hybridize to matching DNA sequence on slide. Quantitate the data and represent it as a matrix, although we tend to display it in terms of colors, as shown here. Results usually shown as a ratios matrix (sample/reference) Experiments appear in columns Genes appear in rows Sizes range, 10,000 x 30 reasonable

Motif-finding Methods Goal: Look for motifs (5-15bp) in the data set Methods: Word enumeration method Gibbs sampling Random projection Phylogenetic footprinting Reducer

Word Enumeration For every word w, calculate: Expected frequency based on entire upstream region of the yeast genome E.g., P(ATTGA) = (0.4)4(0.1)1, given P(A) = P(T) = 0.4, P(G)=P(C) = 0.1 Expected number of occurrences of ATTGA: n*P(ATTGA) Observed frequency in the data set Statistical significance of enrichment Z = (O - E) / sqrt[np  (1 - p)] ~ N(0, 1) Disadvantage: only consider exact word E.g, YCTGCA: TCTGCA and CCTGCA

Gibbs Sampling Matrix to capture a motif Goal: find the best ak to maximize the difference between motif and background base distribution. a1 a2 a3 a4 ak Liu, X

Gibbs Sampling (Lawrence, et al, 1993) Step 1: Pick random start position, compute current motif matrix Step 2: Iterative update Take one sequence out, update motif matrix Calcuate fitness score of each position of out sequence Pick start position in out sequence based on weight Ax Take out another sequence, …, until converge Step 3: Reset starting position Liu, X

Gibbs Sampling Initialization Pick random start position, compute motif matrix ak a1' a3' a4' ak' a2' Liu, X

Gibbs Sampling Iteration Steps 1) Take out one sequence, calculate the fitness score of every subsequence relative to the current motif a1' ????????????????? a2' a3' a4' ak' Liu, X

Fitness Score Ax = Qx / Px Current Motif Ax = Qx / Px Qx: probability of generating subsequence x from current motif Px: probability of generating subsequence x from background 1 2 3 A 0.1 0.3 0.7 T 0.2 G 0.4 C Background: P(A) = P(T) = 0.4 P(G) = P(C) = 0.1 X = GGA: Q? P?

Gibbs Sampling Iteration Steps 2) Pick new start position sampling from fitness score ak' Liu, X

Recent Development Random Projection Phylogenetic Footprinting Reducer

Random Projection (Buhler, 2002) (l, d)-motif problem: M is an (unknown) motif of length l Each occurrence of M is corrupted by exactly d point substitutions in random positions No known biological motifs are of (l, d)-motif CCcaAG CCcgAG CCgcAG CCtaAG CCtgAG CtATgG CCctAc tCtTAG CaAcAG CCAgAa

Random Projection Algorithm Guiding principle: Some instances of a motif agree on a subset of positions. Use information from multiple motif instances to construct model. ATGCGTC ...ccATCCGACca... ...ttATGAGGCtc... ...ctATAAGTCgc... ...tcATGTGACac... (7,2) motif x(1) x(2) x(5) x(8) =M Buhler, J

k-Projections Choose k positions in string of length l. Concatenate nucleotides at chosen k positions to form k-tuple. In l-dimensional Hamming space, projection onto k dimensional subspace. l = 15 k = 7 P ATGGCATTCAGATTC TGCTGAT Buhler, J P = (2, 4, 5, 7, 11, 12, 13)

Random Projection Algorithm Choose a projection by selecting k positions uniformly at random. For each l-tuple in input sequences, hash into bucket based on letters at k selected positions. Recover motif from bucket containing multiple l-tuples. Input sequence x(i): …TCAATGCACCTAT... Bucket TGCT TGCACCT Buhler, J

Example l = 7 (motif size) , k = 4 (projection size) Choose projection (1,2,5,7) Input Sequence ...TAGACATCCGACTTGCCTTACTAC... ATGC ATCCGAC GCTC Buckets GCCTTAC Buhler, J

Hashing and Buckets Hash function h(x) obtained from k positions of projection. Buckets are labeled by values of h(x). Enriched buckets: contain more than s l-tuples, for some parameter s. ATTC CATC GCTC ATGC Buhler, J

Motif Refinement How do we recover the motif from the sequences in the enriched buckets? k nucleotides are known from hash value of bucket. Use information in other l-k positions as starting point for local refinement scheme, e.g. EM or Gibbs sampler ATGC ATCCGAC ATGAGGC ATAAGTC ATGTGAC Local refinement algorithm ATGCGTC Candidate motif Buhler, J

Parameter Selection Projection size k Choose k small so several motif instances hash to same bucket. (k < l - d) Choose k large to avoid contamination by spurious l-mers. ( 4k > t (n - l + 1) Bucket threshold s: (s = 3, s = 4) Buhler, J

Recent Development Random Projection Phylogenetic Footprinting Reducer

Conservation of Regulatory Elements in Upstream of ApoAI Gene Hepatic site C CCAAT box Mouse Rabbit Human Chicken TATA box TATA box TATA box

AAGCA ACGCA

Substring Parsimony Problem Given: orthologous upstream sequences S1,…Sn phylogenetic tree T of the n species size k of the motif, threshold d Problem: Find all sets of substrings s1,…sn of S1,…Sn , each of size k, such that the parsimony score of s1,…sn on T is at most d Blanchette, M

Parsimony Score s1 s2 s`34 s6 s5 s4 s3 Tree T: Minimum (all possible labelings of internal nodes) l(v) – label of node v d(l1, l2) – Hamming distance Blanchette, M

String Parsimony Problem S1: AAAGCATTC S2: TACGCACCC S3: GAAGCAGGG AAGCA ACGCA k = 5 d = 1 S1 S2 S3

Algorithm: version I Root the tree at arbitrary internal node r Compute table Wu of size 4k for each node u, where Wu[s] – best parsimony score for subtree rooted at u when u is labeled with s Direct implementation of this recursion gives O(n∙k∙(42k + l), where l – average sequence length Blanchette, M

Algorithm: version II u labeled s w v Define X(u, v)[s] – best parsimony score for subtree consisting of edge (u,v) and the subtree rooted at v u labeled s w v Blanchette, M

Algorithm: version II (continued) Update X(u, v) in phases: in phase p maintain set Bp of sequences t, such that X(u, v)[t] = p Define: Ra = {s: Wv[s] = a} N(s) = {t in ∑k: d(s, t) = 1} Start in phase m and let Bm = Rm Update Computation of X(u, v) takes O(k∙4k) Blanchette, M

Improvements Reduce the size of Bp when sequences contribute to X(u, v) greater than threshold d In phase p, only care for sequence X(u, v) [s] if Leads to significant reductions in stages d/2 … d Reduce the number of substrings inserted in W at the leaves For substring s of Si, if its best match against any Sj, has Hamming distance at least d, s can be discarded Blanchette, M

Results Practical limit on k = 10 There appeared to be a threshold d0 with very few solutions below and many above Algorithm found ~80% known binding sites Performed better than ClustalW, MEME, Consensus Blanchette, M

Recent Development Random Projection Phylogenetic Footprinting Reducer

Reducer (Bussemaker, et al 2001) Links motif finding to expression level Ag = C + Σ Fu Nug Ag: gene expression level (logarithm of expression ratio) M: number of significant motifs Ng: number of occurrences of motif u in gene g C: baseline expression level (same for all genes) F: increase/decrease of expression level caused by presence of motif

Reducer (Cont’d) Log ratio of expression levels Gene1 Gene2 Gene3 Expression vector Log ratio of expression levels Gene1 Gene2 Gene3 Gene4 … GeneN 1.3 -3.7 10.3 4.5 -2.3 Motif vector Number of times that motif occurs in the upstream region of the gene AAAAA 2 5 3 AAAAT 1 Liu, X

Reducer (Cont’d) Normalize expression (A) and motif (n) vectors Linear regression between A vector and every n vector to find the best fit n to A Step-wise regression to combine effects of motifs Subtract the effect of one motif Find the next best motif Liu, X

Acknowlegement People from whom I borrowed slides: Xiaole Liu (Reducer) Olga Troyanskaya (Microarray) Jeremy Buhler (Random projections) Mathieu Blanchette (Phylogenetic footprinting) Various web sources

overlay images and normalise excitation scanning cDNA clones (probes) laser 2 laser 1 PCR product amplification purification emission printing mRNA target) overlay images and normalise 0.1nl/spot microarray Hybridise target to microarray analysis

Information Content of Motifs Uncertainty Information = Hbefore - Hafter

Improvement on Original Gibbs sampler 0 ~ n copies of sites in each sequence Iterative masking to find multiple motifs Use higher order Markov models to improve motif specificity

Clinical Importance of Defects in Regulatory Elements Burkitt’s Lymphoma

Statistical Methods Expectation Maximization (EM) Gibbs sampling MEME BioProspector AlignACE

Motifs are not limited to DNAs RNA motifs RNA – RNA interaction motifs, e.g., intron-exon splice sites RNA – protein interaction motifs, e.g., binding of proteins to RNA polyA tail Protein motifs E.g., Helix-turn-helix motif

Sequence Logo

Why is this Problem Hard? Motif information content low Hamming distance between each motif instance high