Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)

Slides:



Advertisements
Similar presentations
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Advertisements

PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Ab initio gene prediction Genome 559, Winter 2011.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha.
Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno 1 Biological Information.
Lecture 2 Molecular Biology Primer Saurabh Sinha.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Finding Subtle Motifs by Branching from Sample Strings Xuan Qi Computer Science Dept. Utah State Univ.
Pattern Discovery in RNA Secondary Structure Using Affix Trees (when computer scientists meet real molecules) Giulio Pavesi& Giancarlo Mauri Dept. of Computer.
Transcription factor binding motifs (part I) 10/17/07.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)
Algorithms for Regulatory Motif Discovery Xiaohui Xie University of California, Irvine.
Prepared with lots of help from friends... Metsada Pasmanik-Chor, Zohar Yakhini and NUMEROUS WEB RESOURCES. BioInformatics / Computational Biology Introduction.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Ab initio motif finding
Finding Regulatory Motifs in DNA Sequences
2.7 DNA Replication, transcription and translation
Finding Regulatory Motifs in DNA Sequences. Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC.
REGULATORY GENOMICS Saurabh Sinha, Dept. of Computer Science & Institute of Genomic Biology, University of Illinois.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Case Study. DNA Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known.
Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Sequence analysis – an overview A.Krishnamachari
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Outline More exhaustive search algorithms Today: Motif finding
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Greedy Algorithms CS 498 SS Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
From Genomes to Genes Rui Alves.
Algorithms in Bioinformatics: A Practical Introduction
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string.
Motif Finding [1]: Ch , , 5.5,
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Motif Search and RNA Structure Prediction Lesson 9.
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Projects
REGULATORY GENOMICS Saurabh Sinha, Dept. of Computer Science & Institute of Genomic Biology, University of Illinois.
A Very Basic Gibbs Sampler for Motif Detection
Gibbs sampling.
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Learning Sequence Motif Models Using Expectation Maximization (EM)
Higher Biology Gene Expression Mr G R Davidson.
Algorithms for Regulatory Motif Discovery
Introduction to Bioinformatics II
Nora Pierstorff Dept. of Genetics University of Cologne
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
Presentation transcript:

Motif finding: Lecture 1 CS 498 CXZ

From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded) Alphabet size = 4 (A,C,G,U) 3.mRNA  amino acid sequence Alphabet size = 20 4.Amino acid sequence “folds” into 3- dimensional molecule called protein AATACGAAGTAA AAUACGAAGUAA Asn Thr Lys Stop

Gene expression Process of making a protein from a gene as template Transcription, then translation Can be regulated

Transcription Process of making a single stranded mRNA using double stranded DNA as template Only genes are transcribed, not all DNA

Step 1: From DNA to mRNA Transcription SOURCE:

GENE ACAGTGA TRANSCRIPTION FACTOR PROTEIN Transcriptional regulation

GENE ACAGTGA TRANSCRIPTION FACTOR PROTEIN Transcriptional regulation

The importance of gene regulation

Genetic regulatory network controlling the development of the body plan of the sea urchin embryo Davidson et al., Science, 295(5560):

That was the “circuit” responsible for development of the sea urchin embryo Nodes = genes Switches = gene regulation Change the switches and the circuit changes Gene regulation significance: –Development of an organism –Functioning of the organism –Evolution of organisms

Binding sites and motifs

Binding sites Binding sites of transcription factor “Bicoid”, collected experimentally

T A A T C C C Motif(“Consensus String”)

W A A T C C N Motif W = T or A N = A,C,G,T

Motif Common sequence “pattern” in the binding sites of a transcription factor A succinct way of capturing variability among the binding sites

A C G T Alternative way to represent motif Position weight matrix (PWM) Or simply, “weight matrix”

Motif representation Consensus string –May allow “degenerate” symbols in string, e.g., N = A/C/G/T; W = A/T; S = C/G; R = A/G; Y = T/C etc. Position weight matrix –More powerful representation –Probabilistic treatment

The motif finding problem Suppose a transcription factor (TF) controls five different genes Each of the five genes should have binding sites for TF in their promoter region Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Binding sites for TF

The motif finding problem Now suppose we are given the promoter regions of the five genes G1, G2, … G5 Can we find the binding sites of TF, without knowing about them a priori ? –Binding sites are similar to each other, but not necessarily identical This is the motif finding problem To find a motif that represents binding sites of an unknown TF

A variant of motif finding Given a motif (e.g., consensus string, or weight matrix), find the binding sites in an input sequence For consensus string, problem is trivial –For each position l in input sequence, check if substring starting at position l matches the motif. For weight matrix, not so trivial

Binding sites from a weight matrix motif A C G T W A C G T Counts of each base In each column Probability of each base In each column W  k = probability of base  in column k Given a string s of length l = 7 s = s 1 s 2 …s l Pr(s | W) = Example: Pr(CTAATCCG) = 0.67 x 0.89 x 1 x 1 x 0.89 x 1 x 0.89 x 0.11

Binding sites from a weight matrix motif Given sequence S (e.g., 1000 base-pairs long) For each substring s of S, –Compute Pr(s|W) –If Pr(s|W) > some threshold, call that a binding site Look at S, as well as its “reverse complement” –Rev.Compl. of AGTTACACCA is TGGTGTAACT –(That’s what is on the other strand of DNA)

Ab initio motif finding The original motif finding problem To find a motif that represents binding sites of an unknown TF

Ab initio motif finding Define a motif score, find the motif with maximum score over all possible motifs in search space (motif model) Consensus string model => exhaustive search algorithm, guarantee on finding the optimal motif PWM model => local search, not guaranteed to find optimal motif.

Ab initio motif finding - consensus string motifs A precise motif model defines the search space (I.e., a list of all candidate motifs). The motif model also prescribes exactly how to determine if a substring is a match to a particular motif. Define motif model precisely

Ab initio motif finding - consensus string motifs E.g., string over alphabet {A,C,G,T} of fixed length l. If l = 4, all 256 strings AAAA, AAAT, AAAC, …, TTTT, are “candidate motifs”. E.g., string over alphabet {A,C,G,T} of fixed length l, and allowing up to d mismatches. If AAAA is a motif, and d=1, then AAAT, AATA etc. are also counted as matches to motif. E.g., string over extended alphabet {A,C,G,T,N} of fixed length l. Here “N” stands for any character (A,C,G,or T.) –If AANAA is the motif, then AACAA, AAGAA, AATAA or AAAAA are all counted as matches to this motif.

Ab initio motif finding - consensus string motifs Define a motif score, i.e., a real number associated with each candidate motif, in relation to the input sequences. E.g., count N s of a motif s in input sequences(s). E.g., some function of the motif count N s. –E.g., Z s = (N s - E s )/  s – E s is the expected count of motif s in random sequences; and –  s is the variance of the count in random sequences

Ab initio motif finding - consensus string motifs For each motif s in the search space, –Compute the score of s Output the highest scoring motifs. This is the “enumerative” algorithm. Guaranteed to produce the optimal motif, since every possible motif is considered. Guarantee possible due to small search space. (E.g., 4 l where l is the motif length). Cant handle large values of l (e.g., > 10) : exponential growth of running time.

Ab initio motif finding - PWM motifs Local search techniques, e.g., Gibbs sampling Expectation Maximization Greedy

Gibbs sampling: The search space Input: a set of sequences {S 1,S 2,…,S n } Input: motif length l Candidate motif: A set of substrings {s 1,s 2,…,s n }, each of length l, one from each S i. Search space: all possible candidate motifs –O(L n ) where L is length of each S i.

Gibbs sampling: algorithm Consider any candidate motif {s 1,s 2,…,s n },where each s i is of length l Let W  k be the frequency of base  in the k th position of the candidate motif –Pr(s|W) = Let “background” (genome-wide) frequency of nucleotide  be q 

Gibbs sampling: algorithm Let current motif be W t = {s 1,s 2,…,s n } Pick one s i to replace For each substring s’ in S i, replace s i with s’ and compute

Gibbs sampling: algorithm Pick s’ with probability proportional to Pr(s’) as computed Replace s i with s’ to obtain new current motif M t+1 Keep updating motif Report the motif with maximum score