Presentation on theme: "Transcription as a Permutation Algorithm By M. Nickenig Mentor: Prof. Robert Vellanoweth."— Presentation transcript:
Transcription as a Permutation Algorithm By M. Nickenig Mentor: Prof. Robert Vellanoweth
Transcription: Overview Where does transcription fit into the larger genetic schema: DNA ------------> DNA --------------> mRNA ------------> Protein Specifically, regulation at the transcriptional level involves four major modes: 1) Regulation via the combinatronics of components of the transcriptional machinery (basal transcription machinery). 2) Induction of response elements through inducible transcription factors (TFs). 3) Regulation through the action of interfering RNAs. 4) Chromatin remodeling.
Biochemistry, C. Matthews, K.E. van Holde, K.G. Ahern- 3rd ed. Transcription: Regulation by TFs Transcription factor- a protein that binds DNA at a specific promoter or enhancer site, where it regulates transcription. Basal transcription factors are involved in the formation of a pre-initiation complex.
http://homepages.strath.ac.uk/~dfs97113/BB310/Lect15/lect15.htm Transcription Factors: Structure Regulatory Factors: Activate or Repress transcription Note: Not all transcription factors bind to DNA- some just bind other transcription factors. Basic Structural features of transcription factors: 1) Activation Domain - Three different types: acidic domain, glutamine-rich domain, proline-rich domain * Activation domains interact with the basal machinery to activate transcription
http://homepages.strath.ac.uk/~dfs97113/BB310/Lect15/lect15.htm Transcription Factors: Structure 2) DNA binding domain - Helix-turn-helix (HTH) bind the major groove of the DNA.Two anti-parallel alpha-helical regions interrupted by a turn region. Zinc fingers function as structural platforms for DNA binding. This type of transcription factor has an absolute requirement for zinc for their formation. Two types: 2-His, 2-Cys Zn finger and Multi-Cys Zn finger
http://homepages.strath.ac.uk/~dfs97113/BB310/Lect15/lect15.htm Transcription Factors: Structure B-Zip: Leucine zippers function in associating the transcription factors with each other. Posses a basic DNA binding domain (B-domain) adjacent to a leucine zipper dimerization domain. Function as dimers. The leucine zipper dimerization domain is found in many transcription factors: Basic Domain
http://homepages.strath.ac.uk/~dfs97113/BB310/Lect15/lect15.htm Transcription Factors: Regulation Consider the following example of transcriptional regulation by an inducible TF:
Transcription Factors: Regulation Consider an additional example of transcriptional regulation by an inducible TFs:
Bioinformatics Vol. 15, 1999, 563-577 Transcription: Probabilities Goal: We want to find all transcription factor binding sequences in the Arabidopsis thaliana transcriptome using a suitable motif-finding program Assumptions: Functionally related DNA sequences are generally expected to share some common sequence elements. The pattern shared by a set of functionally related sequences is commonly identified during the process of aligning the sequences to maximize sequence conservation. A good alignment is assumed to be one whose alignment matrix is rarely expected to occur by chance. Furthermore, we assume that the distribution of letters is independent and is randomly distributed. Thus, the probability of an alignment matrix is determined by the multinomial distribution;
Transcription: Probabilities Mathematical Terms: Where, i, refers to the rows of the alignment matrix( i.e. the bases A, C, G, T), j, refers to the columns of the matrix (i.e. the letters within the alignment pattern), A is the total number of letters in the sequence alphabet, L, is the total number of columns in the matrix, p i, is the a priori probability of the letter, i, n ij, is the occurrence of the letter i at the position j and N is the total number of sequences in the alignment (Reference: Bioinformatics Vol.15, 1999, 563-577). Furthermore, the above formula can be extended to calculate the probabilities associated with cis-regulatory modules: such that the sum is taken over all sequences in a module (L all ), the factor, (1/m ), is a normalization constant where, m, equals the number of sequences of lengths, L, comprising the module.
Thesis: A. Mortazavi, 2004 Vellanoweth Lab, CSULA Transcription: Probabilities cis-Regulatory Module- a set of motifs that bind transcription factors cooperatively. For example, consider the following Cistematic derived sequence data which corresponds to the Lipid Transfer Protein (LTP) module (Thesis: A. Mortazavi, 2004 Vellanoweth Lab): First we calculate the probability associated with this alignment using the method of Hertz and Stromo; then this is followed by a calculation where aligned sequences are broken up into blocks and each block is treated as a mutually exclusive event.
Thesis: A. Mortazavi, 2004 Vellanoweth Lab Transcription: Probabilities An alignment matrix can be formed from a gap alignment and the probability subsequently calculated, e.g the T-COFFE derived gap alignment of the LTP module:
Bioinformatics Vol.15, 1999, 563-577 Transcription: Probabilities Sample Calculation: LTP module regions 1 and 3 - (Method of Hertz and Stromo)
Transcription: Probabilities Table 1: Probabilities- LTP Module (Hertz/Stromo Method: Reference- Bioinformatics Vol.15, 1999, 563-577) Note: Calculations adjusted according to background model based on Arabidopsis genome base frequencies- A: 0.3180185 T: 0.318015 G: 0.1819815 C: 0.1819815 Hertz/StromoMutually Exclusive WidthRegionP matrix 714.39E-14 4021.23E-664.22E-16 1032.51E-18 71Module?1.35E-97?2.22E-14
Proceedings of the 3rd International Conference on Bioinformatics and Genome Research, G.Z. Hertz and G. Stromo Transcription: P-Value Probabilities By the calculation of probabilities resulting statements concerning statistical significance can be formulated through estimations of the P-value using large-deviation statistics. In particular, Hertz and Stromo provide a statistical analysis method based upon the observation that when “the information content is small and the number of sequences is large, 2NI tends to a chi-squared distribution…” with L(A-1) degrees of freedom. In particular, the probability of sequence alignment containing gaps is: where, n -j, the occurrence of a gap at the position j in the alignment. N, L, A and, n ij, have been defined previously.
Proceedings of the 3rd International Conference on Bioinformatics and Genome Research, G.Z. Hertz and G. Stromo Transcription: P-Value Then the information content (large-deviation rate function) of the corresponding sequence alignment is: Where f ij = n ij /N. “To calculate the overall statistical significance, we consider the probability distribution of and it’s large-deviation rate function of “
Proceedings of the 3rd International Conference on Bioinformatics and Genome Research, G.Z. Hertz and G. Stromo Transcription: P-Value The overall statistical significance…. is equal to the inverse of the product of 2 NL and the probability of a large-deviation rate function greater than or equal to (I gap matrix + L ln 2) based on the probability distribution, P, above. Sample Calculation: P-Value LTP module region 1 - (Based on method of Hertz/Stromo)
Transcription: Probabilities Table 1: Probabilities- LTP Module (Hertz/Stromo Method: Reference- Bioinformatics Vol.15, 1999, 563-577) Note: Calculations adjusted according to background model based on Arabidopsis genome base frequencies- A: 0.3180185 T: 0.318015 G: 0.1819815 C: 0.1819815 WidthRegionProb matrix (Hertz/Stromo) Prob matrix (Mutually Exclusive) P-Value 714.39E-14 6.88E-07 4021.23E-664.22E-161.01E-45 1032.51E-18 2.10E-11 71Module?1.35E-97?2.22E-141.54E-66
Transcription: Permutation Furthermore, it is desired to devise a method to arrive at groupings of genes that are coregulated. These coexpressed gene clusters are expected to respond to either internal or external stimuli which can be visualized, as a first approximation, in a microarray. This concerted genetic response is presumed to be governed by the action of a conserved set of response elements interacting with a distinct set of transcription factors. By focusing on gene clustering we expect to detect the presence of transcription factor binding sites using the motif finding program Cistematic augmented with a statistical method, which will be described below.
Transcription: Permutation Statistical Method: It occurred to the author that a simple plot of occurrences by probabilities would yield visualizations of data trends output by Cistematic.
www.arabidopsis.org Transcription: Permutation Microarray 17808T7 was designed to identify gene expression changes that occur during shoot development in Arabidopsis. Root explants were incubated on a callus induction medium (CIM) during which time they acquire 'competence' to respond to hormones that induce shoot formation. Explants are then transferred to cytokinin-rich shoot induction medium (SIM) where they organize meristems and undergo shoot morphogenesis. Shoot Development Scan 1Shoot Developemnt Scan 2Vascular DevolpmentShoot Devlopment Scan 3Shoot Devlopment ScanVascular Development 2Shoot Development in tissue culture 1Shoot Development in tissue culture 2
www.arabidopsis.org Genes Gene Info AT2G22430Homeobox-leucine zipper protein 6 (HB-6) / HD-ZIP transcription factor 6, identical to homeobox-leucine zipper protein ATHB-6 (HD-ZIP protein ATHB-6) AT5G01870Lipid transfer protein, putative, similar to lipid transfer protein 6 from Arabidopsis thaliana (gi:8571927); contains Pfam protease inhibitor/seed storage/LTP family domain PF00234 AT5G59330 (AT5G59330: hypothetical protein) AT5G59310Lipid transfer protein 4 (LTP4), identical to lipid transfer protein 4 from Arabidopsis thaliana (gi:8571923); contains Pfam protease inhibitor/seed storage/LTP family domain PF00234 AT5G59320Lipid transfer protein 3 (LTP3), identical to lipid transfer protein 3 from Arabidopsis thaliana (gi:8571921); contains Pfam protease inhibitor/seed storage/LTP family domain PF00234) AT1G50570C2 domain-containing protein, low similarity to cold-regulated gene SRC2 (Glycine max) GI:2055230; contains Pfam profile PF00168: C2 domain AT2G05380Glycine-rich protein (GRP3S), identical to cDNA glycine-rich protein 3 short isoform (GRP3S) GI:4206766) AT2G38540Nonspecific lipid transfer protein 1 (LTP1), identical to SP|Q42589 Transcription: Genes
Transcription: Permutation Here we have a plot of occurrences versus probabilities of the 15-mer data derived from microarray 17808T7. Notice the definite skew in the 15-mer Motifs(X20) graph.
Transcription: Permutation Results The following motifs have been found thus far: YTCAYAYCMARYARCCAWCAYCWCSCRCTTCCATMYRAATCCCT AT5G59310XXX AT5G59320XXX AT5G59330XXX AT2G05380XX AT2G38540XX AT1G50570XXX AT2G22430XX AT5G01870XXX
Acknowledgements Prof. Robert Vellanoweth CSULA