Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Molecular Biology Biochem 218 – BioMedical Informatics 231 Discovering Transcription.

Similar presentations


Presentation on theme: "Computational Molecular Biology Biochem 218 – BioMedical Informatics 231 Discovering Transcription."— Presentation transcript:

1 Computational Molecular Biology Biochem 218 – BioMedical Informatics Discovering Transcription Factor Binding Sites in Co-Regulated Genes Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy)

2 Motivation Searching for conserved sequence motifs regulating the expression MicroArray analysis of whole genome gene expression Clustering of genes based on their expression pattern

3 Megacluster of Yeast Gene Expression

4 T Cells Signaling DNA Damage Fibroblast Stimulation B Cells Signaling CMV Infection Anoxia Polio Infection Monocytes Signaling IL4 Hormone Human Gene Expression Signatures

5 Upstream Regions Co- expressed Genes GATGGCTGCACCACGTGTATGC...ACGATGTCTCGC CACATCGCATCACGTGACCAGT...GACATGGACGGC GCCTCGCACGTGGTGGTACAGT...AACATGACTAAA TCTCGTTAGGACCATCACGTGA...ACAATGAGAGCG CGCTAGCCCACGTGGATCTTGA...AGAATGACTGGC Finding Transcription Factor Binding Sites Pho 5 Pho 8 Pho 81 Pho 84 Pho … Transcription Start

6 Upstream Regions Co-expressed Genes GATGGCTGCACCACGTGTATGC...ACGATGTCTCGC CACATCGCATCACGTGACCAGT...GACATGGACGGC GCCTCGCACGTGGTGGTACAGT...AACATGACTAAA TCTCGTTAGGACCATCACGTGA...ACAATGAGAGCG CGCTAGCCCACGTGGATCTTGT...AGAATGGCCTAT Finding Transcription Factor Binding Sites

7 Upstream Regions Co-expressed Genes ATGGCTGCACCACGTTTATGC...ACGATGTCTCGC CACATCGCATCACGTGACCAGT...GACATGGACGGC GCCTCGCACGTGGTGGTACAGT...AACATGACTAAA TTAGGACCATCACGTGA...ACAATGAGAGCG CGCTAGCCCACGTTGATCTTGT...AGAATGGCCTAT Pho4 binding Finding Transcription Factor Binding Sites

8 Three Algorithms BioProspector o Presented in 2000 o Extends Gibbs sampling (stochastic method) o For any cluster of sequences MDScan o Deterministic approach o Enumerative o Very fast o For sequences with some ranking information MotifCut and MotifScan o Graph-based o Does not use PSSMs o Novel and sensitive

9 Representing Ambiguous DNA Motifs Sequence Patterns (Regular expressions) IUPAC nomenclatures for DNA ambiguities Consensus motif: CACAAAA Degenerate motif: CRCAAAW A/T A/G

10 Weight Matrix for Transcription Factor Binding Sites A DNA Motif as a position specific frequency weight matrix Sites ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG Alignment Matrix Frequency weight Matrix

11 Weight Matrix with Consensus Sequence & Logotype with Degenerate Consensus TTWHYCGGHY Weight Matrix or Position Specific Scoring Matrix

12 BioProspector Initialization Gather together upstream regulatory regions

13 BioProspector Initialization a1a1 a2a2 a3a3 a4a4 akak Actual Location of Regulatory Motifs is Unknown

14 BioProspector Initialization Initial Motif Randomly initialize the beginning motif a3'a3' a4'a4' ak'ak' a2'a2' a1'a1'

15 a3'a3' a4'a4' ak'ak' a2'a2' Motif Without a 1 ' Segment a1'a1' BioProspector Iterative Update Take out one sequence at a time with its segment

16 a3'a3' a4'a4' ak'ak' a2'a2' Motif Without a 1 ' Segment Segment (1-6): 1.5 Sequence 1 BioProspector Iterative Update Score each segment with the current motif

17 a3'a3' a4'a4' ak'ak' a2'a2' Motif Without a 1 ' Segment Segment (2-7): 3 Sequence 1 BioProspector Iterative Update Score each segment with the current motif

18 a3'a3' a4'a4' ak'ak' a2'a2' Motif Without a 1 ' Segment Segment (3-8): 2.7 Sequence 1 BioProspector Iterative Update Score each segment with the current motif

19 a3'a3' a4'a4' ak'ak' a2'a2' Motif Without a 1 ' Segment Segment (4-9): 9.0 Sequence 1 BioProspector Iterative Update Score each segment with the current motif

20 a3'a3' a4'a4' ak'ak' a2'a2' Motif Without a 1 ' Segment Segment (5-10): 3.2 Sequence 1 BioProspector Iterative Update Score each segment with the current motif

21 a3'a3' a4'a4' ak'ak' a2'a2' Motif Without a 1 ' Segment Segment (6-11): 27.1 Sequence 1 BioProspector Iterative Update Score each segment with the current motif

22 a3'a3' a4'a4' ak'ak' a2'a2' Motif Without a 1 ' Segment Segment (7-12): 11.2 Sequence 1 BioProspector Iterative Update Score each segment with the current motif

23 a3'a3' a4'a4' ak'ak' a2'a2' Motif Without a 1 ' Segment Segment (8-13): 2.9 Sequence 1 BioProspector Iterative Update Score each segment with the current motif

24 a3'a3' a4'a4' ak'ak' a2'a2' Motif Without a 1 ' Segment Segment (9-14): 9.1 Sequence 1 BioProspector Iterative Update Score each segment with the current motif

25 a3'a3' a4'a4' ak'ak' a2'a2' Motif Without a 1 ' Segment a1"a1" Candidate Motif BioProspector Iterative Update Score sequence 1 in all possible alignments

26 a3'a3' a4'a4' ak'ak' a2'a2' a1"a1" BioProspector Iterative Update Repeat the process until convergence Motif Without a 2 ' Segment

27 Challenges for BioProspector Variable (0-n) motif sites per sequence Motif enriched only in upstream sequences, not in the whole genome Some motifs could have two conserved blocks separated by a variable length gap Motifs are not highly conserved (~50%) Some motifs show a palindromic symmetry Assign motifs a measure of statistical significance

28 Thresholds Allow for Variable Motif Copies Sequences that do not have the motif Sequences with multiple copies of motif THTH TLTL Sample Discard Keep

29 BioProspector Finds Motif With Two Blocks Two-block motifs: GACACATTACCTATGC TGGCCCTACGACCTCTCGC CACAATTACCACCA TGGCGTGATCTCAGACACGGACGGC GCCTCGATTACCGTGGTA TGGCTAGTTCTCAAACCTGACTAAA TCTCGTTAGATTACCACCCA TGGCCGTATCGAGAGCG CGCTAGCCATTACCGAT TGGCGTTCTCGAGAATTGCCTAT

30 BioProspector Finds Motifs With Two Blocks Two-block motifs Sequence Min Gap Max Gap blk1block Sample Block2 start

31 BioProspector Finds Motif With Inverse Complementary Blocks Two-block motifs Palindrome motifs: AATGCG GCGTAA

32 B. subtilis transcription best studied 136 σ A -dependent promoter sequences [-100, 15] Look for w 1 = w 2 = 5, gap[15, 20] two-block motif Correctly identified motif [TTGACA, TATAAT] and 70% of all the sites Occasionally predicted two promoters Correct site Second site abrB TTGACGTACAAT veg TTGACATATAAT f105 TTTACATACAAT BioProspector Results: B. subtilis two-block promoter

33 BioProspector Web Server:

34 BioProspector Web Server:

35 Compare Prospector Liu et al, 2004, Genome Res 14(3):

36 Compare Prospector 1 kb Liu et al, 2004, Genome Res 14(3): Regions conserved between two species Motif

37 Compare Prospector Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene n Biased sampling: Initial iterations: T ch Later iterations: T cl T ch T cl Liu et al, 2004, Genome Res 14(3): 451-8,

38 Compare Prospector (Liu Y et al, Nucleic Acids Res 32:W204-7)

39 Compare Prospector (Liu Y et al, Nucleic Acids Res 32:W204-7)

40 Yeast Rap1 Sequences Chromatin immunoprecipitation + microarray (ChIP-on-chip, ChIP-array, IP) experiment

41 Cross link protein- DNA interaction Yeast Rap1 Sequences Chromatin immunoprecipitation + microarray (ChIP-on-chip, ChIP-array, IP) experiment

42 Cross link protein-DNA interaction Shear DNA Yeast Rap1 Sequences Chromatin immunoprecipitation + microarray (ChIP-on-chip, ChIP-array, IP) experiment

43 Immunoprecipitation Yeast Rap1 Sequences Chromatin immunoprecipitation + microarray (ChIP-on-chip, ChIP-array, IP) experiment

44 PCR amplify and label DNA Yeast Rap1 Sequences Chromatin immunoprecipitation + microarray (ChIP-on-chip, ChIP-array, IP) experiment

45 Hybridize with microarray and measure reading Yeast Rap1 Sequences Chromatin immunoprecipitation + microarray (ChIP-on-chip, ChIP-array, IP) experiment

46 Cross link protein- DNA interaction Shear DNAImmunoprecipitation Purify DNA PCR amplify and label DNA Hybridize with microarray and measure reading Yeast Rap1 Sequences Chromatin immunoprecipitation + microarray (ChIP-on-chip, ChIP-array, IP) experiment

47 Chromatin Immune Precipitation

48 Yeast Rap1 Sequences Chromatin immunoprecipitation + microarray (ChIP-on-chip, ChIP-array, IP) experiment Rap1 IP Enriched 727 DNA fragments o 45% are intergenic o Average length 1-2 KB o Some are false positives o Some have multiple Rap1 sites

49 Useful Insights In ChIP-array experiments, highly enriched sequences are usually the real targets Transcription factor binding sites occurs more abundantly in these real targets Search TF sites from high-confidence sequences first before examine the rest sequences? Motif Discovery Scan (MDscan)

50 MDscan Algorithm: Define m-matches For a given w-mer and any other random w-mer TGTAACGT8-mer TGTAACGTmatched 8 AGTAACGTmatched 7 TGCAACATmatched 6 TGACACGGmatched 5 AATAACAGmatched 4 m-matches for an 8-mer Pick a reasonable m, e.g. in yeast

51 MDscan Algorithm: Finding candidate motifs Top Seqs Seed 1 All IP enriched sequences m-matches

52 MDscan Algorithm: Finding candidate motifs Top Seqs Seed 2 All IP enriched sequences m-matches

53 MDscan Algorithm: Finding candidate motifs Top Seqs Seed 3 All IP enriched sequences m-matches

54 MDscan Algorithm: Scanning sequences with top motifs Keep top scoring candidate motifs: Motif Signal Abundance Conserved Positions Specificity (unlikely in genome)

55 MDscan Algorithm: Scanning sequences with top motifs Keep top scoring candidate motifs: Scan the rest of the sequences with the candidate motifs Motif Signal Abundance Conserved Positions Specificity (unlikely in genome)

56 MDscan Algorithm: Finding All Motif Instances Top Seqs Seed 3 All IP enriched sequences m-matches

57 MDscan Algorithm: Refine the motifs Top Seqs Seed 3 All IP enriched sequences m-matches X X X

58 MDscan Simulation Nine motif matrix models with 3 widths and 3 degeneracy GACTCCCA GATTGCCT GGCTACCT GACTACCA GAGTACCA GACTATCT GAGTACCA GGCTCCCA GACTCCCA W8S1 More Conserved W8S3 Less Conserved GACTCCGA GGGAACCA GCTTCCAA GACTACCA CAGTACGA GGCTAGCA GACTGCCG GACTACCA GACTCCCG

59 MDscan Simulation Each test set: 100 sequences of 600 bases from yeast intergenic Motif segments generated and inserted according to the following abundance: Higher confidence Motif more abundant

60 MDscan Simulation 100 tests for 3 widths 3 strengths 4 abundances 3600 tests

61 MDscan Simulation 100 tests for 3 widths 3 degeneracy 4 abundance 3 X Consensus MDscan speed14 X BioProspector 27 XAlignACE 3600 tests

62 MDscan Simulation Accuracy w = 8

63 MDscan Simulation Accuracy w = 12

64 MDscan Simulation Accuracy w = 16

65 MDscan Biological Tests Gal4 & Ste12 [Ren et al. Science 2000] o Gal4: galactose metabolism o Ste12: responds to mating pheromones

66 MDscan Biological Tests SBF & MBF [Iyer et al. Nature 2001] o SBF: Swi4 + Swi6 budding, membrane, cell wall biosynthesis o MBF: Mbp1 + Swi6 DNA replication and repair

67 MDscan Biological Tests Rap1 [Lieb et al. Nature Genetics 2001] o Repressor activator o 37% pol II events in exponentially growing cells

68 TAMO: Tools for the Analysis of Motifs

69 WebMotifs

70 WebMotifs

71 Melina: Comparing Motifs

72 Melina: Comparing Motifs

73 Single Microarray Determination of Transcription Factor Motifs One microarray experiment, no clustering needed Basic idea: more affected sequences may contain more motif TF sites Expression log ratio Genes Induced Repressed

74 Summary BioProspector is stochastic BioProspector can get trapped in local maxima BioProspector must be run multiple times to discover the true globally optimal motif BioProspector is slow MDScan is deterministic MDScan always gives the same answer with the same data MDScan is fast MDScan uses rank order data to accelerate the search process and to allow it to be deterministic MDScan is fast enough to search intergenic regions from entire genomes. MDScan is not as sensitive as BioProspector


Download ppt "Computational Molecular Biology Biochem 218 – BioMedical Informatics 231 Discovering Transcription."

Similar presentations


Ads by Google