Download presentation

Presentation is loading. Please wait.

1
cisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold

2
cisGreedy Algorithm similar to Consensus motif finder – Greedy method over multiple iterations – De novo motif finder based on input values Implemented in Cistematic package using Python Goal: To provide an efficient Greedy algorithm to be included in the Cistematic package that performs similarly to Consensus

3
Cistematic One motif finder is generally insufficient Further automated analysis performed to refine motifs Enhances motif finder performance through additional steps Image: Ali Mortazavi

4
Cistematic Image: Ali Mortazavi cisGreedy becomes part of “Bottom Tier” Offers an alternative to downloading Consensus software – Additional motif finders will be made available

5
What is a Motif? cis-Regulatory elements – Transcription Factor Binding Sites(TFBS) – Binding by transcription factors may increase or decrease transcription of genes Gene Regulation believed to be a major source of complexity – Plants may have more genes or larger genomes than humans – are they more complex?

6
Multiple Products from One Gene Other methods to increase complexity – Polyadenylation Different “endings” available – Alternative splicing Many more cDNAs – Methylation Identification of cis-regulatory elements will help us understand gene regulatory networks

7
Motif Finding in DNA Sequences cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc 5 sample sequences

8
Motif Finding in DNA Sequences cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc But motifs are rarely conserved to such a degree

9
Motif Finding in DNA Sequences cctgatagacgctatctggctatccaTgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatacTtaGgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgAacgAgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacCtCcgtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaaGgtGcgtc Motifs less discernable without 100% identity

10
Motif Finding in DNA Sequences cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttCcaaccat agtactggtgtAcAtttGatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaAAAtttt agcctccgatgtaagtcatagctgtaactattacctgccacCcCtAttacatcttacgtacgtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc Other subsequences which are not motifs may appear more conserved – filtering out noise becomes challenging!

11
Motifs are degenerate Only certain positions need to be specified – Binding Sites for different control elements may overlap – more complex regulation Often use Position Specific Frequency Matrix (PSFM) where each nucleotide is represented as a fraction - columns add to 1 Also represented by “Motif Logo”

12
How do we find motifs? Hard to identify – Relatively short sequences – Many positions not well conserved Factors improving identification – Usually localized in certain proximity of a gene (search within 3 kb upstream) – Some positions highly conserved – Use other data (Microarray?)

13
Motif Finders Greedy – Maximizes similarity of motifs from sequences through a greedy approach Gibbs Sampling – Attempts to find best motifs using a combination of probability and scores to avoid local maximums being identified Expectation Maximization

14
Consensus Score Determine number of occurrences of each base at each position

15
Consensus Score Determine number of occurrences of each base at each position

16
Consensus Score Determine number of occurrences of each base at each position

17
Consensus Score Determine number of occurrences of each base at each position

18
Consensus Score Determine number of occurrences of each base at each position

19
Consensus Score Determine number of occurrences of each base at each position

20
Consensus Score Determine number of occurrences of each base at each position

21
Consensus Score Determine number of occurrences of each base at each position

22
Consensus Score Determine number of occurrences of each base at each position Sum of the occurrences of each nucleotide at every index must add to the total number of sequences included

23
Consensus Score Determine number of occurrences of each base at each position Identify the most common base at each position – Consensus Sequence Consensus Sequence

24
Consensus Score Determine number of occurrences of each base at each position Identify the most common base at each position – Consensus Sequence Add occurrence of each base in the consensus sequence at each index to determine consensus Score Consensus Sequence Consensus Score = 31

25
Position Specific Frequency Matrix TGGGGGA TGAGAGA TGGGGGA TGAGAGA TGAGGGA A0.0 0.60.00.40.01.0 C0.0 G 1.00.41.00.61.00.0 T1.00.0 Frequencies are the number of each base at every position divided by the total number of sequences Sum for each column is 1 (at least one base must occur)

26
Motif Logo TGGGGGA TGAGAGA TGGGGGA TGAGAGA TGAGGGA A0.0 0.60.00.40.01.0 C0.0 G 1.00.41.00.61.00.0 T1.00.0 bioalgorithms.info Frequencies affect logo size Size of letter indicates the frequency of occurrence relative to other sequences Size indicates confidence of letter

27
Consensus Scoring Use equation similar to log likelihood called Information Content Hertz, Gerald Z., and Gary D. Stormo. "Identifying DNA and protein patterns with statistically significant alignments of multiple sequences." Bioinformatics 8 1999: 563- 577. L columns in the matrix A = {A,C,G,T} frequency of each letter i at each position j a priori probabiliy of letter i Our implementation substitutes the a priori probability with a specific dependent probability based on the Markov Model

28
cisGreedy Input sequences are analyzed – possibly establish background – Background models are used to filter out noise Randomly select 2 sequences to b compaired

29
cisGreedy The two selected sequences are independently analyzed

30
cisGreedy The two selected sequences are independently analyzed Windows of motif size are scanned starting at the beginning of each sequence

31
cisGreedy Sequences are scanned in an attempt to locate the highest scoring alignment – Alignments are ungapped – Score is the Information Content

32
cisGreedy Reverse Complements are analyzed (unless specified otherwise) Once start locations are established with a top alignment score, these are left unchanged

33
cisGreedy Select an additional sequence in which to identify the location of the motif Additional sequences windows are aligned to previous established windows (hence Greedy)

34
cisGreedy Additional sequence scanned as before, reverse complement (unless otherwise specified) Alignment score established as before

35
cisGreedy Final motif locations are taken in order to build position specific frequency matrices Reverse complement sequence used in building PSFM if used

36
cisGreedy User Input Sequence input Motif size (may be a range) Number of motifs cisGreedy should find Iterations to perform at each step before selecting a motif Background model Markov Model Size Reverse complement – whether to include it May designate which sequences will be “founder” sequences – select homologs Designate percent identity between founder sequences

37
cisGreedy Output Multiple motifs represented as PWMs or PSFMs Motifs represented as symbols. – Basic nucleotides represented by respective symbols (A- adenine, etc) – Remaining symbols may require threshold NTDsSymbols ACM AGR ATW CGS CTY GTK ACGV ACTH AGTD CGTB ACGTN

38
Symbol Example A.75.25010 01001 C00.7500 00.25 0 G.75.2501010.75 0 T00000000000 RRSAGMGASSA

39
cisGreedy - Optimization Zoops - Zero or One Occurrence per Sequence – If no good motifs identified in a sequence it is removed If subsequence’s Pvalue is not greater than the average PValue Background model (default Markov3 model) – can be input (Ex: C/G-rich regions) – Markov model can be up to Markov6 (unreasonable for input sequences of a certain size) Find multiple Motifs – mask each motif after identification (windows cannot be reused) Allow for ranges of motif lengths Perform multiple iterations before choosing a motif – Avoid local maxima

40
cisGreedy Markov3 Background Model Collection of all 4-mers with corresponding frequency of word in input sequences Use 4-mer frequencies in order to describe P- value of last nucleotide in the 4-mer – Nucleotide p-value not independent Probability of any sequence is the product of the probability of each nucleotide which make up that sequence A word is deemed significant if its probability is less than the average of all words of the same size in the background model

41
cisGreedy Markov3 - Example Each word has a probability associated with it – Probability of seeing the word based on its frequency in the model 5.3 * 10 -12

42
cisGreedy Markov3 - Example Each word has a probability associated with it – Probability of seeing the word based on its frequency in the model – Describes probability of seeing letter in the last position based on the 3-mer preceding it 5.3 * 10 -8

43
Calculating probability of a word Calculation of word probability based on Markov Models -ln probability of subsequence Sequence

44
1kb upstream region of a yeast gene nucleotide distribution Sequence Position (Upstream from Transcription Start Site) Distribution of nucleotide probabilities based on Markov Model -ln probability of nucleotide

45
1kb upstream region of a yeast gene word probabilities -ln probability of sequence - ln probabilities of all words based on Markov Model Sequence Position (Upstream from Transcription Start Site)

46
1kb upstream region of a yeast gene word probabilities -ln probability of sequence Sequence Position (Upstream from Transcription Start Site) - ln probabilities of all words based on Markov Model

47
1kb upstream region of a yeast gene word probabilities -ln probability of sequence Sequence Position (Upstream from Transcription Start Site) - ln probabilities of all words based on Markov Model

48
1kb upstream region of a yeast gene Motif probability -ln probability of sequence Sequence Position (Upstream from Transcription Start Site) - ln probabilities based on Markov Model of all words within 50 nucleotides of a know Yeast motif

49
Probability of a motif Probabilities of seeing a motif given a background should be lower – Chance of seeing the word at random should be low A motif will not have an extremely low probability as it should be seen multiple times in a data set for it to be identified

50
MSP Results – Testing using nematode data – C. elegans and C. briggsae – Major Sperm Protein (MSP) Cytoskeletal element required for mobility of nematode spermatozoa Multiple genes in genomes Co-regulated

51
MSP Results – cisGreedy motifs Motifs represented by symbols identified by MEME

52
MSP Results – cisGreedy motifs Motifs represented by symbols identified by cisGreedy

53
MSP Results – MEME motifs Motifs identified by MEME plotted on input sequences – Total 10 motifs identified (not all plotted)

54
MSP Results – cisGreedy motifs Motifs identified by cisGreedy plotted on input sequences – Total 10 motifs identified (not all plotted)

55
Future goals Test CisGreedy with dataset used in paper analyzing available motif finding tools Make adjustments to improve results Build upon CisGreedy to make more complex algorithms - Weeder? Additionally motif finders based on different theories Gibbs Sampler Expectation maximization

56
References Bioalgorithms.info Jones, Neil C., and Pavel A. Pevzner. An Introduction to Bioinformatics Algorithms. : MIT Press, 2004. Hertz, Gerald Z., and Gary D. Stormo. "Identifying DNA and protein patterns with statistically significant alignments of multiple sequences." Bioinformatics 8 1999: 563-577. Tompa, Martin et al. “Assessing computational tools for the discovery of transcription factor binding sites." Nature Biotechnology January 2005: 137-144. http://cistematic.caltech.edu

57
Acknowledgements Ali Mortazavi Barbara Wold Wold Lab funding provided by DOE & NASA Additional funding by NSF & NIH SoCalBSI faculty, staff and fellow students

Similar presentations

© 2021 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google