Lecture 5.1 5.1: Gene Regulation and Promoter Analysis Wyeth Wasserman Centre for Molecular Medicine and Therapeutics Children’s and Women’s Hospital Department.

Lecture 5.1 5.1: Gene Regulation and Promoter Analysis Wyeth Wasserman Centre for Molecular Medicine and Therapeutics Children’s and Women’s Hospital Department of Medical Genetics University of British Columbia www.cisreg.ca

Lecture 5.1 Overview 5.1.0 Bioinformatics for detection of transcription factor binding sites The Specificity Problem 5.1.1 Discrimination of regulatory control sequences Based on knowledge of established TFBS 5.1.2 Discovery of regulatory mechanisms Based on de novo pattern discovery 5.1.3 Impending advances

Lecture 5.1 Layers of Complexity in Metazoan Transcription

Lecture 5.1 Transcription Simplified TATAURE URFPol-II

Lecture 5.1 5.1.0 Profile Models for Prediction of TF Binding Sites

Lecture 5.1 Representing Binding Sites for a TF A set of sites represented as a consensus VDRTWRWWSHD (IUPAC degenerate DNA) A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3 C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2 T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4 A matrix describing a a set of sites A single site AAGTTAATGA Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA

Lecture 5.1 TGCTG = 0.9 PFMs to PWMs One would like to add the following features to the model: 1. Correcting for the base frequencies in DNA 2. Weighting for the confidence (depth) in the pattern 3. Convert to log-scale probability for easy arithmetic A 5 0 1 0 0 C 0 2 2 4 0 G 0 3 1 0 4 T 0 0 1 1 1 A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5 0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3 T -1.7 -1.7 -0.2 -0.2 -0.2 f matrix w matrix Log ( ) f(b,i) + s(N) p(b)

Lecture 5.1 JASPAR OPEN-ACCESS DATABASE OF TF BINDING PROFILES (Some other databases with TF profiles include Transfac, TRRD, mPromDB, SCPD (yeast), dbTBS and EcoTFS (bacteria))

Lecture 5.1 Performance of Profiles 95% of predicted sites bound in vitro (Tronche 1997) MyoD binding sites predicted about once every 600 bp (Fickett 1995) Futility Theorem Nearly 100% of predicted TFBS have no function in vivo Brazma claims it should be called the futility conjunction

Lecture 5.1 1000bp promoter screened with collection of TF profiles (beta-globin)

Lecture 5.1 5.1.1 Pattern Discrimination Overcoming the specificity problem by incorporating biological knowledge into computational algorithms

Lecture 5.1 Phylogenetic Footprinting 70,000,000 years of evolution reveals most regulatory regions.

Lecture 5.1 SIDENOTE: Global Progressive Alignments (e.g. ORCA, AVID, LAGAN) Global alignments memory = product of sequence lengths Progressive alignment by banding with local alignments (e.g. BLAST) and running global method on banded sub-segments Recursion with decreasingly stringent parameters ORCA

Lecture 5.1 Phylogenetic Footprinting to Identify Functional Segments % Identity Actin gene compared between human and mouse. 200 bp Window Start Position (human sequence)

Lecture 5.1 Phylogenetic Footprinting (cont) FoxC2 100% 80% 60% 40% 20% 0% % Identity Start Position of 200bp Window

Recall...

Lecture 5.1 1000bp beta-globin promoter screened with phylogenetic footprinting

Lecture 5.1 Choosing the ”right” species... Genes evolve at different rates – make gene-specific choice COW MOUSE CHICKEN HUMAN

Lecture 5.1 Performance: Human vs. Mouse Pairwise Testing set: 40 experimentally defined sites in 15 well studied genes (Replicated with 100+ site set) 85-95% of defined sites detected with conservation filter, while only 11-16% of total predictions retained SELECTIVITYSENSITIVITY

ConSite Now driven by the ORCA Aligner

Lecture 5.1 Selected Emerging Issues Multiple sequence comparisons –Incorporate phylogenetic distances into a scoring metric –Visualization (see dcode service and Sockeye) –Analysis of many closely related species “Phylogenetic shadowing” Genome rearrangements –Inversion compatible alignment algorithm LAGAN Higher order models of TFBS

Lecture 5.1 Regulatory Modules for better specificity TFs do NOT act in isolation

Lecture 5.1 Layers of Complexity in Metazoan Transcription

Lecture 5.1 Liver regulatory modules

Lecture 5.1 PSSMs for Liver TFs… HNF1 C/EBP HNF3 HNF4

Lecture 5.1 Detection of Clusters of TFBS In the best cases, we have enough data to train a discriminant function Rare to have sufficient data Alternatively, identify dense clusters of sites that are statistically significant Diverse methods have been introduced over the past few years…Berman; Markstein; Frith; Noble; Wagner;… Non-trivial to correct for non-random properties of DNA –Most difficulty comes from local direct repeats A primary challenge from the biological side is the selection of a meaningful grouping of TFs Multiple testing problems severe

Lecture 5.1 TFBS Clusters (MSCAN, MCAST, COMET, etc) MSCAN allows users to submit any set of TF profiles Calculates significance for each site based on local sequence characteristics G-rich PSSM gets less weight on G-rich region of gene Calculates cluster significance using a dynamic programming approach Approximately 1 significant liver cluster / 18 000 bp in human genome sequence Filters to remove “significant” clusters of sites that contain local repeats Identification of non-random characteristics in DNA http://mscan.cgb.ki.se

Lecture 5.1 Training predictive models for modules Not every combination of sites is meaningful Reality: Some factors critical, others secondary An alternative is to teach the computer which combinations are better Limited by small size of positive training set Explore an older method based on Logistic Regression Analysis

Lecture 5.1 Recall: Liver regulatory modules

Logistic Regression Analysis          “logit” Optimize  vector to maximize the distance between output values for positive and negative training data. Output value is: e logit p(x)= 1 + e logit

Lecture 5.1 PERFORMANCE Liver (Genome Research, 2001) –At 1 hit per 35 kbp, identifies 60% of modules –Limited to genes expressed late in liver development LRA Models do not account for multiple sites for the same TF* *Frith et al’s CISTER algorithm circumvent this problem

Lecture 5.1 UDPGT1 (Gilbert’s Syndrome) Wildtype Mutant Liver Module Model Score “Window” Position in Sequence

Lecture 5.1 Making better predictions Profiles make far too many false predictions to have predictive value in isolation Phylogenetic footprinting eliminates about 90% of false predictions while retaining ~70-70% of real sites (human vs mouse) Detection of clusters of binding sites offers better predictive performance, especially through trained discrimination functions

Lecture 5.1 Active Issues Significance of clusters of sites Segmentation of DNA into regions of different composition Methods using training to find clusters Where to place weights? Interaction weighting in the absence of large data collections Resources Limited number of solid PSSMs Need a reference database for functional regulatory regions Validation of predictions for tissues/cells not well represented in cell culture

Lecture 5.1 EMERGING APPLICATION Regulatory Analysis of Variation in ENhancers Genetic variation in TFBS can result in biomedically important phenotypes

Lecture 5.1 Sequence Variation in TFBS TSSAaGT URF GENEDISEASE/CONDITION (associated)REFERENCE UGT1A1Gilbert’s Syndrome –jaundicePJ Bosma, et al., 1995 UCP3Elevated Body MassS Otabe et al., 2000 TNFalphaMalaria SusceptibilityJC Knight et al., 1999 ResistinElevated Body MassJC Engert et al., 2002 IL4RalphaReduced soluble IL4RH Hackstein et al., 2001 ABCA1Coronary artery diseaseKY Zwarts et al., 2002 ObLeptin levelsJ Hager et al., 1998 PEPCKObesityY. Olswang et al., 2002 PREndometrial cancerI DeVivo et al., 2002 LDLRFamilial hypercholesterolemiaKoivisto et al., 1994

Lecture 5.1 Stage 1: Prediction of Regulatory Regions

Lecture 5.1 Stage 1: Predict Regulatory Regions FoxC2 100% 80% 60% 40% 20% 0% Retrieve orthologous human and mouse gene sequences Align sequences with a global aligner (ORCA) Identify regions of conservation Designs primers for SNP discovery

Lecture 5.1 SIDENOTE: Data/Orthology obtained from GeneLynx (www.genelynx.org)

Lecture 5.1 Stage 2: Analysis of Polymorphisms ACGCATAAGTTAATGAATAACAGAT ACGCATAAGTTAACGAATAACAGAT

Lecture 5.1 Identify variations that generate allele- specific binding sites (predicted) 1234567890123456789012345 ACGCATAAGTTAAtGAATAACAGAT.............c........... Differences in scores Pseudo-data for instructional purposes

Lecture 5.1 RAVEN screenshots

Lecture 5.1 Finding characteristics over-represented in a set of co-regulated genes 5.1.2 Discovery of Mediating TFBS for Sets of Co-Regulated Genes

Pattern Discovery

Lecture 5.1 Linking co-expressed genes from microarrays to candidate TFs

Lecture 5.1 oPOSSUM Project A significant subset of TFs are represented by existing binding profiles Within same structural class, often binding specificity retained (more on this later) Can we link known TFs to a putative regulon by over-representation of predicted binding sites in promoters? Identical concept to the detection of over-represented GO terms from previous session

oPOSSUM Procedure Set of co- expressed genes Automated sequence retrieval from EnsEMBL Phylogenetic Footprinting Detection of transcription factor binding sites Statistical significance of binding sites Putative mediating transcription factors ORCA

Lecture 5.1 Reference Gene Sets Fisher p-values ++ p<1e-05, + p<1e-02

Lecture 5.1 MICROARRAY APPLICATION: NF-kB Inhibitor-sensitive genes (326) Set of co- expressed genes 1. Automated sequence retrieval from EnsEMBL 2. Phylogenetic Footprinting 3. Detection of known transcription factor binding sites 4. Statistical significance of binding sites ORCA Z score Fisher Putative transcription factors Set of co- expressed genes 1. Automated sequence retrieval from EnsEMBL 2. Phylogenetic Footprinting 3. Detection of known transcription factor binding sites 4. Statistical significance of binding sites ORCA Z score Fisher Putative transcription factors Set of co- expressed genes 1. Automated sequence retrieval from EnsEMBL 2. Phylogenetic Footprinting 3. Detection of known transcription factor binding sites 4. Statistical significance of binding sites ORCA Z score Fisher Putative transcription factors Set of co- expressed genes 1. Automated sequence retrieval from EnsEMBL 2. Phylogenetic Footprinting 3. Detection of known transcription factor binding sites 4. Statistical significance of binding sites ORCA Z score Fisher Putative transcription factors Set of co- expressed genes 1. Automated sequence retrieval from EnsEMBL 2. Phylogenetic Footprinting 3. Detection of known transcription factor binding sites 4. Statistical significance of binding sites ORCA Z score Fisher Putative transcription factors Set of co- expressed genes 1. Automated sequence retrieval from EnsEMBL 2. Phylogenetic Footprinting 3. Detection of known transcription factor binding sites 4. Statistical significance of binding sites ORCA Z score Fisher Putative transcription factors Set of co- expressed genes 1. Automated sequence retrieval from EnsEMBL 2. Phylogenetic Footprinting 3. Detection of known transcription factor binding sites 4. Statistical significance of binding sites ORCA Z score Fisher Putative transcription factors Set of co- expressed genes 1. Automated sequence retrieval from EnsEMBL 2. Phylogenetic Footprinting 3. Detection of known transcription factor binding sites 4. Statistical significance of binding sites ORCA Z score Fisher Putative transcription factors ++HMGSRY +++bHLH-ZIPMax ++++ForkheadFREAC-4 ++++ ForkheadHFH-2 ++++ ETSSPI-B ++++ HMGSox-5 +++++ HomeoPbx +++++ Rel/NFkBp50 +++++ Rel/NFkBc-Rel ++++++ Rel/NFkBp65 ++++++ Rel/NFkBNF-kappaB Fisher p- value z-score p- value Class Genes Significantly Down-regulated After Treatment with Inhibitor ++++ p<1e-30, +++ p<1e-10, ++ p<1e-05, + p<1e-02

oPOSSUM Server

Lecture 5.1 Over-represented Site Combinations (Kreiman 2004) Based on our understanding of CRMs, likely that combinations of sites would be more distinguished than individual sites (better signal-to-noise?) Kreiman has introduced a system to assess clusters of neighbouring conserved sites based on counting –Hypergeometric distribution, simply compare the frequency of the cluster occurrence vs. expectation

Lecture 5.1 What if the TFBS is novel?

Lecture 5.1 de novo Pattern Discovery Methods String-based –e.g. “Moby Dick” (Bussemaker, Li & Siggia) –Identify over-represented oligomers in comparison of “+” and “-” (or complete) promoter collections Profile-based –Monte Carlo/Gibbs Sampling –e.g. AnnSpec (Workman & Stormo) –Identify strong patterns in “+” promoter collection vs. background model of expected sequence characteristics

Lecture 5.1 String-base Exhaustive Methods Word-based methods: How likely are X words in a set of sequences, given sequence characteristics? CCCGCCGGAATGAAATCTGATTGACATTTTCC>EP71002 (+) Ce[IV] msp-56 B; range -100 to -75 TTCAAATTTTAACGCCGGAATAATCTCCTATT >EP63009 (+) Ce Cuticle Col-12; range -100 to -75 TCGCTGTAACCGGAATATTTAGTCAGTTTTTG >EP63010 (+) Ce Cuticle Col-13; range -100 to -75 TATCGTCATTCTCCGCCTCTTTTCTT >EP11013 (+) Ce vitellogenin 2; range -100 to -75 GCTTATCAATGCGCCCGGAATAAAACGCTATA >EP11014 (+) Ce vitellogenin 5; range -100 to -75 CATTGACTTTATCGAATAAATCTGTT >EP11015 (-) Ce vitellogenin 4; range -100 to -75 ATCTATTTACAATGATAAAACTTCAA >EP11016 (+) Ce vitellogenin 6; range -100 to -75 ATGGTCTCTACCGGAAAGCTACTTTCAGAATT >EP11017 (+) Ce calmodulin cal-2; range -100 to -75 TTTCAAATCCGGAATTTCCACCCGGAATTACT >EP63007 (-) Ce cAMP-dep. PKR P1+; range -100 to -75 TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC >EP63008 (+) Ce cAMP-dep. PKR P2; range -100 to -75 ACTGAACTTGTCTTCAAATTTCAACACCGGAA >EP17012 (+) Ce hsp 16K-1 A; range -100 to -75 TCAATGCCGGAATTCTGAATGTGAGTCGCCCT >EP55011 (-) Ce hsp 16K-1 B; range

Lecture 5.1 Exhaustive methods(2) Find all words of length 7 in the yeast genome Make a lookup table: TTTTTTTT/ aaaaaaa 57788 GATAGGCA/ tgcctatc 589 AAACCTTT/ aaaggttt 456 Etc... GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGG ACAAGCGTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGAC GGTAAGAAGATCACTTCTAACCAAAGAATTGTTGCTGCTTTGC CAACCATCAAGTACGTTTTGGAACACCACCCAAGATACGTTGT CTTGTTCTCACTTGGGTAGACCAAACGGTGAAAGAAACGAAAA ATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAATCATTGTTG GGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGAA GTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTT TGTTGGAAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGA AAGGTCGATGGTCAAAAGGTCAAGGCTCAAGGAAGATGTTCA AAAGTTCAGACACGAATTGAGCTCTTTGGCTGATGTTTACATC ACGATGCCTTCGGTACCGCTCACAGAGCTCACTCTTCTATGGT CGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTGTTGGAAAA GGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAG ACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAG ATTCAATTGATTGACAACTTGTTGGACAAGGTCGACTCTATCAT CATTGGTGGTGGTATGGCTTTCCCTTCAAGAAGGTTTTGGAAA ACACTGAAATCGGTGACTCCATCTTCGACAAGGCTGGTGCTG AAATCGTTCCAAAGTTGATGGAAAAGGCCAAGGCCAAGGGTG TCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGATGCTTTC TCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGT ATTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCT AGAAAGTGTTTGCTGCTACTGTTGCAAAGGCTAAGACCATTGT CTGGAACGGTCCACCAGGTGTTTTCGAATTCGAAAAGTTCGCT GCTGGTACTAAGGCTTTGTTAGACGAAGTTGTCAAGAGCTCTG CTGCTGGTAACACCGTCATCATTGGTGGTGGTGACACTGCCA

Lecture 5.1 Over-representation How many words of type ’AGGAGTGA’ are found in our sequences? How likely is this result? Exhaustive methods(3)

Lecture 5.1 Modeling Properties of DNA Simple: How likely are single nucleotides? (extended Bernoulli) Complex: Neglect certain words Locations of TFBS Higher-order descriptions of DNA Exhaustive methods(4)

Lecture 5.1  Algorithms with high complexity - Large sequences and/or many possible word lengths not possible  Often string-based  TFBS are not words (’fuzzy’ binding)  Sensitivity susceptible to noisy indata (e.g. microarrays) Exhaustive methods: Key items

Lecture 5.1 Motivations: TFBS are not words Efficiency Can be intentionally influenced by biological data Find a local alignment of width x of sites that maximizes information content in reasonable time Usually by Gibbs sampling or EM methods Profile-based Methods (usually probablistic)

Two data structures used: 1) Current pattern nucleotide frequencies q i,1,..., q i,4 and corresponding background frequencies p i,1,..., p i,4 2) Current positions of site startpoints in the N sequences a 1,..., a N, i.e. the alignment that contributes to q i,j. One starting point in each sequence is chosen randomly initially. The Gibbs Sampling algorithm tgacttcc tgatctct agacctca tgacctct Profile Methods (2)

Iteration step Remove one sequence z from the set. Update the current pattern according to tgacttcc tgatctct agacctca tgacctct Pseudocount for symbol j Sum of all pseudocounts in column Profile Methods (3) A ’Score’ the current pattern against each possible occurence a k in z. Draw a new a k with probabilities based on respective score divided by the background model B z

Lecture 5.1 Pattern Discovery Across Orthologous Promoters from Gram-Positive Bacteria Real sets random

EXAMPLE: Yeast Regulatory Sequence Analysis (YRSA) system

Lecture 5.1 Tests of YRSA System PDR3-regulated genes from array study Classic cell-cycle array data re-clustered by Getz et al DNA-damage response partially mediating by MCB

SIDENOTE: Comparison of profiles requires alignment and a scoring function Scoring function based on sum of squared differences Align frequency matrices with modified Needleman-Wunsch algorithm Calculate empirical p-values based on simulated set of matrices Score Frequency

Lecture 5.1 How is the Performance: Hit and Miss

Lecture 5.1 Applied Pattern Discovery is Acutely Sensitive to Noise SEQUENCE LENGTH PATTERN SIMILARITY vs. TRUE MEF2 PROFILE True Mef2 Binding Sites

Lecture 5.1 Over-coming the sensitivity challenge Metazoan genomes are far from ideal

Lecture 5.1 Biochemical complexity enables greater complexity in regulation 500 bp Yeast ORF A GO Humans 20 000 bp EXON 1EXON 32 GO

Lecture 5.1 Four Approaches to Improve Sensitivity Better background models -Higher-order properties of DNA Phylogenetic Footprinting –Human:Mouse comparison eliminates ~75% of sequence Regulatory Modules –Architectural rules Limit the types of binding profiles allowed –TFBS patterns are NOT random

Lecture 5.1 Phylogenetic Footprinting to Identify Conserved Regions Bayes Block Aligner (Lawrence Group) ORCA

Lecture 5.1 Skeletal Muscle Genes One of the most extensively studied tissues for transcriptional regulation –45 genes partially analyzed –26 genes with orthologous genomic sequence from human and rodent Five primary classes of transcription factors –Principal: Myf (myoD), Mef2, SRF –Secondary: Sp1 (G/C rich patches), Tef (subset of skeletal muscle types)

Lecture 5.1 de novo Discovery of Skeletal Muscle Transcription Factor Binding Sites Mef2-LikeSRF-LikeMyf-Like

Lecture 5.1 Pattern discovery methods using biochemical constraints

Lecture 5.1 tgacttcc tgatctct agacctca tgacctct RECALL: Gibbs Algorithm tgacttcc tgatctct agacctca tgacctct z

Lecture 5.1 Some profile constraints have been explored… Segmentation of informative columns Palindromic patterns

Lecture 5.1 Intra-family PSSM similarity TF Database (JASPAR) COMPARE Match to bHLH Jackknife Test87% correct Independent Test Set93% correct

Lecture 5.1

FBPs enhance sensitivity of pattern detection

Lecture 5.1

APPLICATION: Cancer Protection Response Detoxification-related enzymes are induced by compounds present in Broccoli Arrays, SSH and hard work have defined a set of responsive genes A known element mediates the response (Antioxidant Responsive Element) Controversy over the type of mediating leucine zipper TF NF-E2/Maf or Jun/Fos

Gibbs Sampling Application (2) Problem: Given a set of co-regulated genes, determine the common TFBS. Classify the mediating TF. We expect a leucine zipper-type TF.

Application (3) Problem: Given a set of co-regulated genes, determine the common TFBS. Classify the mediating TF. We expect a leucine zipper-type TF. Gibbs with FBP Prior

Application (4) Problem: Given a set of co-regulated genes, determine the common TFBS. Classify the mediating TF. We expect a leucine zipper-type TF. Classify New TF Motif Maf (p<0.02) Jun (p<0.98)

Lecture 5.1 EMERGING METHOD de novo Analysis of Regulatory Modules

Focus on regulatory modules for pattern detection Cluster Genes by Expression Identify and Model Contributing TFs 6 0 0 0 7 0 0 2 8 4 7 1 0 2 0 0 4 0 0 8 0 0 0 0 1 0 0 6 Predictive Models

Width Distributions (Sum of Separations) Binding Profiles 250  General Circuit Properties ii = Number of Sites Distribution 3  Separation Distribution (Default = Uniform) 100bp  Analyze co-regulated genes to define circuit characteristics 0  ij Neighbor Interactions ii jj ii jj Specific Gene Features

Lecture 5.1 Discovery performance Approximately 50% of annotated TFBS are detected in the training set sequences of 25 genes Only 40% of predicted TFBS are annotated We suspect that most of the un-annotated sites will turn out to be functional. This needs to be determined.

Lecture 5.1 Review of Primary Points Second Chance

Lecture 5.1 Regulatory regions problem space Sets of binding sites AATCACCA AATCACCA AATCACCA AATCACCA AATCTCCC AATCTCCG AATCACAC AATCATCA AATCTCAC AATCTCTG AGTCCCCA AATCCCGG AATCTGAG AATCCATA ATTCAGCC AATAACTT GATAACCT AATTAGAC GATTACAG GATTAGCG ATTCTTCC TATGAACA GATTAAAA AGACCCCA Sets of binding sites AATCACCA AATCACCA AATCACCA AATCACCA AATCTCCC AATCTCCG AATCACAC AATCATCA AATCTCAC AATCTCTG AGTCCCCA AATCCCGG AATCTGAG AATCCATA ATTCAGCC AATAACTT GATAACCT AATTAGAC GATTACAG GATTAGCG ATTCTTCC TATGAACA GATTAAAA AGACCCCA Specificity profiles for binding sites A [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ] C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ] G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ] T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ] Specificity profiles for binding sites A [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ] C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ] G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ] T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ] Clusters of binding sites Transcription factors Transcription factor binding sites Regulatory nucleotide sequences Transcription factors Transcription factor binding sites Regulatory nucleotide sequences TATAURE URFPol-II

Lecture 5.1 Detecting binding sites in a single sequence Scanning a sequence against a PWM A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ] G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ] T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ] ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC Abs_score = 13.4 (sum of column scores) Sp1 Calculating the relative score A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ] 1.5128 C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ] 1.23481.2348 2.1222 2.12221.2348 1.5128 1.7457 1.7457 G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ] 1.7457 T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ] -1.5-1.5 -1.5 A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ] -1.5 C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ] -1.5 G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ] 0.4368 -0.2284-1.5 -1.5-1.5 T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ] Max_score = 15.2 (sum of highest column scores) Min_score = -10.3 (sum of lowest column scores) Scanning 1300 bp of human insulin receptor gene with Sp1 at rel_score threshold of 75% Ouch. Is 93% better than 82%?

Lecture 5.1 Low specificity of profiles: too many hits great majority are not biologically significant A dramatic improvement in the percentage of biologically significant detections Scanning a single sequenceScanning a pair orf orthologous sequences for conserved patterns in conserved sequence regions Phylogenetic Footprints

Applied Pattern Discovery is Acutely Sensitive to Noise SEQUENCE LENGTH PATTERN SIMILARITY vs. TRUE MEF2 PROFILE True Mef2 Binding Sites

Lecture 5.1 Acknowledgements Wasserman Group Wynand Alkema Dave Arenillas Jochen Brumm Alice Chou Shannan Ho Sui Danielle Kemmer Jonathan Lim Raf Podowski Dora Pak Albin Sandelin Chris Walsh Collaborating Trainees Malin Andersson (KTH) Öjvind Johansson (UCSD) Stuart Lithwick (U.Toronto) Support: CIHR, CGDN, CFI, Merck-Frosst, BC Children’s Hospital Foundation, Pharmacia, EC–Marie Curie, KI-Funder Collaborators Boris Lenhard (K.I.) Chip Lawrence (Wadsworth) William Thompson (Wadsworth) Jens Lagergren (KTH) Christer Höög (K.I.) Brenda Gallie (OCI) Jacob Odeberg (KTH) Niclas Jareborg (AZ) William Hayes (AZ) James Mortimer (MF) Group Alumni Elena Herzog Annette Höglund William Krivan Luis Mendoza

Lecture 5.1 EXTRA SLIDES What will a computational biologist do with a scoring function? Build a similarity tree!

Lecture 5.1 The matrix tree:

Lecture 5.1 Compare with consensus for both classes - CANNTG bHLH-Zip domain bHLH domain

Lecture 5.1 5.1: Gene Regulation and Promoter Analysis Wyeth Wasserman Centre for Molecular Medicine and Therapeutics Children’s and Women’s Hospital Department.

Similar presentations

Presentation on theme: "Lecture 5.1 5.1: Gene Regulation and Promoter Analysis Wyeth Wasserman Centre for Molecular Medicine and Therapeutics Children’s and Women’s Hospital Department."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 5.1 5.1: Gene Regulation and Promoter Analysis Wyeth Wasserman Centre for Molecular Medicine and Therapeutics Children’s and Women’s Hospital Department.

Similar presentations

Presentation on theme: "Lecture 5.1 5.1: Gene Regulation and Promoter Analysis Wyeth Wasserman Centre for Molecular Medicine and Therapeutics Children’s and Women’s Hospital Department."— Presentation transcript:

Similar presentations

About project

Feedback