CisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold.

Slides:



Advertisements
Similar presentations
. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.
Advertisements

PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Greedy Algorithms CS 6030 by Savitha Parur Venkitachalam.
Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Bioinformatics Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 3 Finding Motifs Aleppo University Faculty of technical engineering.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
Transcription factor binding motifs (part I) 10/17/07.
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)
An analysis of “Alignments anchored on genomic landmarks can aid in the identification of regulatory elements” by Kannan Tharakaraman et al. Sarah Aerni.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Multiple sequence alignments and motif discovery Tutorial 5.
Similar Sequence Similar Function Charles Yan Spring 2006.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.
CisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Finding Regulatory Motifs in DNA Sequences
Finding Regulatory Motifs in DNA Sequences. Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
An Introduction to Bioinformatics
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.
Comp. Genomics Recitation 3 The statistics of database searching.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Guiding motif discovery by iterative pattern refinement Zhiping Wang Advisor: Sun Kim, Mehmet Dalkilic School of Informatics, Indiana University.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pattern Discovery and Recognition for Genetic Regulation Tim Bailey UQ Maths and IMB.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Sequence Alignment.
Construction of Substitution matrices
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Step 3: Tools Database Searching
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
Transcription factor binding motifs (part II) 10/22/07.
Motif identification with Gibbs Sampler Xuhua Xia
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
A Very Basic Gibbs Sampler for Motif Detection
Gibbs sampling.
Learning Sequence Motif Models Using Expectation Maximization (EM)
Recitation 7 2/4/09 PSSMs+Gene finding
Nora Pierstorff Dept. of Genetics University of Cologne
Presentation transcript:

cisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold

cisGreedy Algorithm similar to Consensus motif finder – Greedy method over multiple iterations – De novo motif finder based on input values Implemented in Cistematic package using Python Goal: To provide an efficient Greedy algorithm to be included in the Cistematic package that performs similarly to Consensus

Cistematic One motif finder is generally insufficient Further automated analysis performed to refine motifs Enhances motif finder performance through additional steps Image: Ali Mortazavi

Cistematic Image: Ali Mortazavi cisGreedy becomes part of “Bottom Tier” Offers an alternative to downloading Consensus software – Additional motif finders will be made available

What is a Motif? cis-Regulatory elements – Transcription Factor Binding Sites(TFBS) – Binding by transcription factors may increase or decrease transcription of genes Gene Regulation believed to be a major source of complexity – Plants may have more genes or larger genomes than humans – are they more complex?

Multiple Products from One Gene Other methods to increase complexity – Polyadenylation Different “endings” available – Alternative splicing Many more cDNAs – Methylation Identification of cis-regulatory elements will help us understand gene regulatory networks

Motif Finding in DNA Sequences cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc 5 sample sequences

Motif Finding in DNA Sequences cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc But motifs are rarely conserved to such a degree

Motif Finding in DNA Sequences cctgatagacgctatctggctatccaTgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatacTtaGgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgAacgAgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacCtCcgtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaaGgtGcgtc Motifs less discernable without 100% identity

Motif Finding in DNA Sequences cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttCcaaccat agtactggtgtAcAtttGatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaAAAtttt agcctccgatgtaagtcatagctgtaactattacctgccacCcCtAttacatcttacgtacgtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc Other subsequences which are not motifs may appear more conserved – filtering out noise becomes challenging!

Motifs are degenerate Only certain positions need to be specified – Binding Sites for different control elements may overlap – more complex regulation Often use Position Specific Frequency Matrix (PSFM) where each nucleotide is represented as a fraction - columns add to 1 Also represented by “Motif Logo”

How do we find motifs? Hard to identify – Relatively short sequences – Many positions not well conserved Factors improving identification – Usually localized in certain proximity of a gene (search within 3 kb upstream) – Some positions highly conserved – Use other data (Microarray?)

Motif Finders Greedy – Maximizes similarity of motifs from sequences through a greedy approach Gibbs Sampling – Attempts to find best motifs using a combination of probability and scores to avoid local maximums being identified Expectation Maximization

Consensus Score Determine number of occurrences of each base at each position

Consensus Score Determine number of occurrences of each base at each position

Consensus Score Determine number of occurrences of each base at each position

Consensus Score Determine number of occurrences of each base at each position

Consensus Score Determine number of occurrences of each base at each position

Consensus Score Determine number of occurrences of each base at each position

Consensus Score Determine number of occurrences of each base at each position

Consensus Score Determine number of occurrences of each base at each position

Consensus Score Determine number of occurrences of each base at each position Sum of the occurrences of each nucleotide at every index must add to the total number of sequences included

Consensus Score Determine number of occurrences of each base at each position Identify the most common base at each position – Consensus Sequence Consensus Sequence

Consensus Score Determine number of occurrences of each base at each position Identify the most common base at each position – Consensus Sequence Add occurrence of each base in the consensus sequence at each index to determine consensus Score Consensus Sequence Consensus Score = 31

Position Specific Frequency Matrix TGGGGGA TGAGAGA TGGGGGA TGAGAGA TGAGGGA A C0.0 G T Frequencies are the number of each base at every position divided by the total number of sequences Sum for each column is 1 (at least one base must occur)

Motif Logo TGGGGGA TGAGAGA TGGGGGA TGAGAGA TGAGGGA A C0.0 G T bioalgorithms.info Frequencies affect logo size Size of letter indicates the frequency of occurrence relative to other sequences Size indicates confidence of letter

Consensus Scoring Use equation similar to log likelihood called Information Content Hertz, Gerald Z., and Gary D. Stormo. "Identifying DNA and protein patterns with statistically significant alignments of multiple sequences." Bioinformatics : L columns in the matrix A = {A,C,G,T} frequency of each letter i at each position j a priori probabiliy of letter i Our implementation substitutes the a priori probability with a specific dependent probability based on the Markov Model

cisGreedy Input sequences are analyzed – possibly establish background – Background models are used to filter out noise Randomly select 2 sequences to b compaired

cisGreedy The two selected sequences are independently analyzed

cisGreedy The two selected sequences are independently analyzed Windows of motif size are scanned starting at the beginning of each sequence

cisGreedy Sequences are scanned in an attempt to locate the highest scoring alignment – Alignments are ungapped – Score is the Information Content

cisGreedy Reverse Complements are analyzed (unless specified otherwise) Once start locations are established with a top alignment score, these are left unchanged

cisGreedy Select an additional sequence in which to identify the location of the motif Additional sequences windows are aligned to previous established windows (hence Greedy)

cisGreedy Additional sequence scanned as before, reverse complement (unless otherwise specified) Alignment score established as before

cisGreedy Final motif locations are taken in order to build position specific frequency matrices Reverse complement sequence used in building PSFM if used

cisGreedy User Input Sequence input Motif size (may be a range) Number of motifs cisGreedy should find Iterations to perform at each step before selecting a motif Background model Markov Model Size Reverse complement – whether to include it May designate which sequences will be “founder” sequences – select homologs Designate percent identity between founder sequences

cisGreedy Output Multiple motifs represented as PWMs or PSFMs Motifs represented as symbols. – Basic nucleotides represented by respective symbols (A- adenine, etc) – Remaining symbols may require threshold NTDsSymbols ACM AGR ATW CGS CTY GTK ACGV ACTH AGTD CGTB ACGTN

Symbol Example A C G T RRSAGMGASSA

cisGreedy - Optimization Zoops - Zero or One Occurrence per Sequence – If no good motifs identified in a sequence it is removed If subsequence’s Pvalue is not greater than the average PValue Background model (default Markov3 model) – can be input (Ex: C/G-rich regions) – Markov model can be up to Markov6 (unreasonable for input sequences of a certain size) Find multiple Motifs – mask each motif after identification (windows cannot be reused) Allow for ranges of motif lengths Perform multiple iterations before choosing a motif – Avoid local maxima

cisGreedy Markov3 Background Model Collection of all 4-mers with corresponding frequency of word in input sequences Use 4-mer frequencies in order to describe P- value of last nucleotide in the 4-mer – Nucleotide p-value not independent Probability of any sequence is the product of the probability of each nucleotide which make up that sequence A word is deemed significant if its probability is less than the average of all words of the same size in the background model

cisGreedy Markov3 - Example Each word has a probability associated with it – Probability of seeing the word based on its frequency in the model 5.3 *

cisGreedy Markov3 - Example Each word has a probability associated with it – Probability of seeing the word based on its frequency in the model – Describes probability of seeing letter in the last position based on the 3-mer preceding it 5.3 * 10 -8

Calculating probability of a word Calculation of word probability based on Markov Models -ln probability of subsequence Sequence

1kb upstream region of a yeast gene nucleotide distribution Sequence Position (Upstream from Transcription Start Site) Distribution of nucleotide probabilities based on Markov Model -ln probability of nucleotide

1kb upstream region of a yeast gene word probabilities -ln probability of sequence - ln probabilities of all words based on Markov Model Sequence Position (Upstream from Transcription Start Site)

1kb upstream region of a yeast gene word probabilities -ln probability of sequence Sequence Position (Upstream from Transcription Start Site) - ln probabilities of all words based on Markov Model

1kb upstream region of a yeast gene word probabilities -ln probability of sequence Sequence Position (Upstream from Transcription Start Site) - ln probabilities of all words based on Markov Model

1kb upstream region of a yeast gene Motif probability -ln probability of sequence Sequence Position (Upstream from Transcription Start Site) - ln probabilities based on Markov Model of all words within 50 nucleotides of a know Yeast motif

Probability of a motif Probabilities of seeing a motif given a background should be lower – Chance of seeing the word at random should be low A motif will not have an extremely low probability as it should be seen multiple times in a data set for it to be identified

MSP Results – Testing using nematode data – C. elegans and C. briggsae – Major Sperm Protein (MSP) Cytoskeletal element required for mobility of nematode spermatozoa Multiple genes in genomes Co-regulated

MSP Results – cisGreedy motifs Motifs represented by symbols identified by MEME

MSP Results – cisGreedy motifs Motifs represented by symbols identified by cisGreedy

MSP Results – MEME motifs Motifs identified by MEME plotted on input sequences – Total 10 motifs identified (not all plotted)

MSP Results – cisGreedy motifs Motifs identified by cisGreedy plotted on input sequences – Total 10 motifs identified (not all plotted)

Future goals Test CisGreedy with dataset used in paper analyzing available motif finding tools Make adjustments to improve results Build upon CisGreedy to make more complex algorithms - Weeder? Additionally motif finders based on different theories Gibbs Sampler Expectation maximization

References Bioalgorithms.info Jones, Neil C., and Pavel A. Pevzner. An Introduction to Bioinformatics Algorithms. : MIT Press, Hertz, Gerald Z., and Gary D. Stormo. "Identifying DNA and protein patterns with statistically significant alignments of multiple sequences." Bioinformatics : Tompa, Martin et al. “Assessing computational tools for the discovery of transcription factor binding sites." Nature Biotechnology January 2005:

Acknowledgements Ali Mortazavi Barbara Wold Wold Lab funding provided by DOE & NASA Additional funding by NSF & NIH SoCalBSI faculty, staff and fellow students