2 Reference PapersXiaohui Xie, Jun Lu, E. J. Kulbokas, Todd R. Golub, Vamsi Mootha, Kerstin Lindblad-Toh, Eric S. Lander, Manolis Kellis, “Systematic discovery of regulatory motifs in human promoters and 3’UTRs by comparison of several mammals”, Nature, 2005Mathieu Blanchette and Martin Tompa, “Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting”, Genome Res :
3 What is a Motif ?A motif is a nucleotide sequence pattern and has biological significance.Regulatory motifs are DNA fragments
4 Motif LogosHeight of letters represents probability of being found in that location in the motif
5 Why is it difficult to find them? 1. Short fragments2. Degenerate3. UnpredictableMotifs can occur in either strands.
6 PromoterIn genetics, a promoter is a DNA sequence that enables a gene to be transcribed. The promoter is recognized by RNA polymerase, which then initiates transcription.
7 3’ UTRThe three prime untranslated region (3' UTR) is a particular section of messenger RNA (mRNA).An mRNA codes for a protein through translation. The mRNA also contains regions that are not translated. In eukaryotes the 5' untranslated region, 3' untranslated region, cap and polyA tail.Image source :
8 What the paper proposes What? Discovering the regulatory motifs in human promoters and 3’ UTRs.How? By comparing sequence motifs of several mammals. That’s why it is called comparative motif finding.Which mammals? Human, mouse, rat, dog.
10 Methods Type Total Sequnce Promoter 68 Mb 3’ UTR 15 Mb Intron Control Chose 17,700 well annotated genes from RefSeq database.Promoters = 4kb centered at transcriptional start site (only noncoding)3-UTRs = based on annotation of reference mRNAIntronic sequences as a control (last two introns from each gene)
11 Motif Conservation Score A motif is said to be conserved when an exact match is found in all 4 species.Conservation = conserved occurrences/all occurrencesMCS =Observed conservation– random conservationStandard deviation
12 Known highly conserved motif Err α [TGACCTTG]Of the 434 times err α occurs in human promoter regions, 162 of them are conserved across all the 4 species.Conservation rate = 37%Random 8-mer motif shows only 6.8% conservation rate
14 Results: Promoter Region 174 highly conserved motifs (MCS > 6)59 strong match to known motifs, 10 weaker match.105 potential new regulatory motifs
15 Approaches to explore biological significance So why is the motif biologically significant?1. tissue specificity2. positional bias
16 Tissue SpecificityTissue specificity of expression for genes containing discovered motifsExpression data for 75 tissues59 of 69 known, and 53 of 105 unknown show tissue specificity
17 Position Bias Motifs show position bias Conserved motifs show strong position biasPreferential occurrence within 100bases of TSS
18 Results: motifs in 3’ UTRs In UTR 106 conserved motifs found (MCS>6)3’-UTR motifs have not studied beforeComparison of discovered motifs to a large collection of previously known motifs not possibleTwo unique propertiesStrand specificityBias towards 8-mers
19 Property1: strand specificity Xie, X. et al., Nature, 2005
20 Property2 : bias towards 8-mers Xie, X. et al., Nature, 2005
21 Digression: miRNA Single stranded RNA transcribed from DNA but not translated into proteinMany mature miRNA start with U followed by a 7-base “seed” complementary to a site in the 3’ UTR of target mRNAs.Thus many are 8 mersmicroRNA that regulates insulin secretion by an NYU study published in Nature.
22 InferenceThus we can infer many of the conserved 8-mer motifs act as binding sites for miRNALeads to discovery of 52% existing miRNA genesLeads to discovery of 129 new miRNA genes
24 Problem Definition (why?) Major challenge of current genomics is to understand how gene expression is regulated.An important step towards this understanding is the capability to identify regulatory elements.
25 What? Phylogenetic footprinting is 1. method for the discovery of regulatory elements2. in orthologous regulatory regions3. from multiple species.
27 Main ideaCoding sequences evolving at a slower rate than non-coding sequences cause selective pressureTransition in a coding sequence can possibly alter the whole function of coded proteinTransition in a non-coding sequence (RE) may only change expression frequency of a gene
28 Phylogenetic Footprinting Study orthologous non-coding DNA from species that are related (phylogenetic tree)Differentiation:TreeFind one motif in many speciesWell conserved = possible Regulatory Element
29 Formalization Given: phylogenetic tree T, set of orthologous sequences at leaves of T,length k of motifthreshold dProblem:Find each set S of k-mers, one k-mer from each leaf, such that the “parsimony” score of S in T is at most d.
30 Small Example Size of motif sought: k = 4 AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp)ACGTGAGATACGT... (Rabbit)GAACGGAGTACGT... (Mouse)TCGTGACGGTGAT... (Rat)Size of motif sought: k = 4
36 Metallothionein Gene Family Large number of promoter sequencesLarge number of REBinding sites occurs within 300 bp of start codon590 bp of sequence located upstream of start codonConserved elements of lengths 7,8,9,10 (K values)Identified 12 motifs of which 4 have been confirmedAnalysis
38 Insulin Gene Family two rodents and a pig (two gene copies each) motifs with 0 mutations, K=8motifs with 1 mutation, K=9,104 conserved motifs identifiedSeveral binding sites missed as they contain very few mutationsAnalysis
39 C-myc Promoter 7 species analyzed Contains members from diverse animal phyla (fishes, birds, mammals, batrachians)4 of 9 predictions known are binding sitesMost located in 120 bp promoter regionAnalysis
40 DrawbacksSome binding sites does not have significant matches to most other speciesSome binding sites show good conservation rate in sequences shorter than footprinter looked atT3R
41 Drawbacks cont’d Deletions/Insertions Failure to meet statistical significanceSome TFs bind as dimers where the binding site may consist of 2 conserved regions, separated by a few variable nucleotides