Transcription as a Permutation Algorithm By M. Nickenig Mentor: Prof. Robert Vellanoweth.

Slides:



Advertisements
Similar presentations
Control of Gene Expression
Advertisements

Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium.
Promoter and Module Analysis Statistics for Systems Biology.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Measuring the degree of similarity: PAM and blosum Matrix
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
DNA sequences alignment measurement
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Bi-correlation clustering algorithm for determining a set of co- regulated genes BIOINFORMATICS vol. 25 no Anindya Bhattacharya and Rajat K. De.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
DATA ANALYSIS Module Code: CA660 Lecture Block 2.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
CisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold.
Sequence similarity.
In silico cis-analysis promoter analysis - Promoters and cis-elements - Searching for patterns - Searching redundant patterns.
Similar Sequence Similar Function Charles Yan Spring 2006.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
In silico cis-analysis promoter analysis - Promoters and cis-elements - Searching for patterns - Searching redundant patterns.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Finding Regulatory Motifs in DNA Sequences
BCOR 1020 Business Statistics
Structure and Function of Eukaryotic Transcription Activators Many have modular structure: 1.DNA-binding domain 2.Transcription activating domain Proteins.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
MCB 317 Genetics and Genomics MCB 317 Topic 10, part 3 A Story of Transcription.
Sigma-aldrich.com/cellsignaling Modular Structure of Transcription Factors.
Introduction to gene expression Seema Zargar. Lecture outline Introduction to all terms used in Gene expression.
Announcements 1. Tuesday afternoon lab section: lab start time next week is 3pm. 2-3 pm might be a good time to do problem set 6! 2. No advance reading.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Sequence analysis – an overview A.Krishnamachari
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Comp. Genomics Recitation 3 The statistics of database searching.
Testing Hypothesis That Data Fit a Given Probability Distribution Problem: We have a sample of size n. Determine if the data fits a probability distribution.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Introduction to Gene Expression
Motif Detection in Yeast Vishakh Joe Bertolami Nick Urrea Jeff Weiss.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Control of Gene Expression Chapter DNA RNA Protein replication (mutation!) transcription translation (nucleotides) (amino acids) (nucleotides) Nucleic.
Copyright © Cengage Learning. All rights reserved. Chi-Square and F Distributions 10.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Lecture12 - Based on Chapter 18 - Regulation of Gene Expression in Eukaryotes I Copyright © 2010 Pearson Education Inc.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Last Class 1. Transcription 2. RNA Modification and Splicing
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Lecture 11. The chi-square test for goodness of fit.
©2001 Timothy G. Standish James 4:7 7Submit yourselves therefore to God. Resist the devil, and he will flee from you.
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
Transcription factor binding motifs (part II) 10/22/07.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Network Motifs See some examples of motifs and their functionality Discuss a study that showed how a miRNA also can be integrated into motifs Today’s plan.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
WRKY transcription factors in potato genome factors in potato genome
Regulation of Gene Expression in Eukaryotes
Learning Sequence Motif Models Using Expectation Maximization (EM)
Albert Xue, Binbin Huang, Jianrong Wang
Relationship between Genotype and Phenotype
Relationship between Genotype and Phenotype
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
WRKY transcription factors in potato genome factors in potato genome
Relationship between Genotype and Phenotype
James 4:7 7 Submit yourselves therefore to God. Resist the devil, and he will flee from you.
Relationship between Genotype and Phenotype
Presentation transcript:

Transcription as a Permutation Algorithm By M. Nickenig Mentor: Prof. Robert Vellanoweth

Transcription: Overview Where does transcription fit into the larger genetic schema: DNA > DNA > mRNA > Protein Specifically, regulation at the transcriptional level involves four major modes: 1) Regulation via the combinatronics of components of the transcriptional machinery (basal transcription machinery). 2) Induction of response elements through inducible transcription factors (TFs). 3) Regulation through the action of interfering RNAs. 4) Chromatin remodeling.

Biochemistry, C. Matthews, K.E. van Holde, K.G. Ahern- 3rd ed. Transcription: Regulation by TFs Transcription factor- a protein that binds DNA at a specific promoter or enhancer site, where it regulates transcription. Basal transcription factors are involved in the formation of a pre-initiation complex.

Transcription Factors: Structure Regulatory Factors: Activate or Repress transcription Note: Not all transcription factors bind to DNA- some just bind other transcription factors. Basic Structural features of transcription factors: 1) Activation Domain - Three different types: acidic domain, glutamine-rich domain, proline-rich domain * Activation domains interact with the basal machinery to activate transcription

Transcription Factors: Structure 2) DNA binding domain - Helix-turn-helix (HTH) bind the major groove of the DNA.Two anti-parallel alpha-helical regions interrupted by a turn region. Zinc fingers function as structural platforms for DNA binding. This type of transcription factor has an absolute requirement for zinc for their formation. Two types: 2-His, 2-Cys Zn finger and Multi-Cys Zn finger

Transcription Factors: Structure B-Zip: Leucine zippers function in associating the transcription factors with each other. Posses a basic DNA binding domain (B-domain) adjacent to a leucine zipper dimerization domain. Function as dimers. The leucine zipper dimerization domain is found in many transcription factors: Basic Domain

Transcription Factors: Regulation Consider the following example of transcriptional regulation by an inducible TF:

Transcription Factors: Regulation Consider an additional example of transcriptional regulation by an inducible TFs:

Bioinformatics Vol. 15, 1999, Transcription: Probabilities Goal: We want to find all transcription factor binding sequences in the Arabidopsis thaliana transcriptome using a suitable motif-finding program Assumptions: Functionally related DNA sequences are generally expected to share some common sequence elements. The pattern shared by a set of functionally related sequences is commonly identified during the process of aligning the sequences to maximize sequence conservation. A good alignment is assumed to be one whose alignment matrix is rarely expected to occur by chance. Furthermore, we assume that the distribution of letters is independent and is randomly distributed. Thus, the probability of an alignment matrix is determined by the multinomial distribution;

Transcription: Probabilities Mathematical Terms: Where, i, refers to the rows of the alignment matrix( i.e. the bases A, C, G, T), j, refers to the columns of the matrix (i.e. the letters within the alignment pattern), A is the total number of letters in the sequence alphabet, L, is the total number of columns in the matrix, p i, is the a priori probability of the letter, i, n ij, is the occurrence of the letter i at the position j and N is the total number of sequences in the alignment (Reference: Bioinformatics Vol.15, 1999, ). Furthermore, the above formula can be extended to calculate the probabilities associated with cis-regulatory modules: such that the sum is taken over all sequences in a module (L all ), the factor, (1/m ), is a normalization constant where, m, equals the number of sequences of lengths, L, comprising the module.

Thesis: A. Mortazavi, 2004 Vellanoweth Lab, CSULA Transcription: Probabilities cis-Regulatory Module- a set of motifs that bind transcription factors cooperatively. For example, consider the following Cistematic derived sequence data which corresponds to the Lipid Transfer Protein (LTP) module (Thesis: A. Mortazavi, 2004 Vellanoweth Lab): First we calculate the probability associated with this alignment using the method of Hertz and Stromo; then this is followed by a calculation where aligned sequences are broken up into blocks and each block is treated as a mutually exclusive event.

Thesis: A. Mortazavi, 2004 Vellanoweth Lab Transcription: Probabilities An alignment matrix can be formed from a gap alignment and the probability subsequently calculated, e.g the T-COFFE derived gap alignment of the LTP module:

Bioinformatics Vol.15, 1999, Transcription: Probabilities Sample Calculation: LTP module regions 1 and 3 - (Method of Hertz and Stromo)

Transcription: Probabilities Sample Calculation: LTP module region 2 - (as Mutually Exclusive events)

Transcription: Probabilities Table 1: Probabilities- LTP Module (Hertz/Stromo Method: Reference- Bioinformatics Vol.15, 1999, ) Note: Calculations adjusted according to background model based on Arabidopsis genome base frequencies- A: T: G: C: Hertz/StromoMutually Exclusive WidthRegionP matrix E E E E-18 71Module?1.35E-97?2.22E-14

Proceedings of the 3rd International Conference on Bioinformatics and Genome Research, G.Z. Hertz and G. Stromo Transcription: P-Value Probabilities By the calculation of probabilities resulting statements concerning statistical significance can be formulated through estimations of the P-value using large-deviation statistics. In particular, Hertz and Stromo provide a statistical analysis method based upon the observation that when “the information content is small and the number of sequences is large, 2NI tends to a chi-squared distribution…” with L(A-1) degrees of freedom. In particular, the probability of sequence alignment containing gaps is: where, n -j, the occurrence of a gap at the position j in the alignment. N, L, A and, n ij, have been defined previously.

Proceedings of the 3rd International Conference on Bioinformatics and Genome Research, G.Z. Hertz and G. Stromo Transcription: P-Value Then the information content (large-deviation rate function) of the corresponding sequence alignment is: Where f ij = n ij /N. “To calculate the overall statistical significance, we consider the probability distribution of and it’s large-deviation rate function of “

Proceedings of the 3rd International Conference on Bioinformatics and Genome Research, G.Z. Hertz and G. Stromo Transcription: P-Value The overall statistical significance…. is equal to the inverse of the product of 2 NL and the probability of a large-deviation rate function greater than or equal to (I gap matrix + L ln 2) based on the probability distribution, P, above. Sample Calculation: P-Value LTP module region 1 - (Based on method of Hertz/Stromo)

Transcription: Probabilities Table 1: Probabilities- LTP Module (Hertz/Stromo Method: Reference- Bioinformatics Vol.15, 1999, ) Note: Calculations adjusted according to background model based on Arabidopsis genome base frequencies- A: T: G: C: WidthRegionProb matrix (Hertz/Stromo) Prob matrix (Mutually Exclusive) P-Value E E E E E E E-11 71Module?1.35E-97?2.22E E-66

Transcription: Permutation Furthermore, it is desired to devise a method to arrive at groupings of genes that are coregulated. These coexpressed gene clusters are expected to respond to either internal or external stimuli which can be visualized, as a first approximation, in a microarray. This concerted genetic response is presumed to be governed by the action of a conserved set of response elements interacting with a distinct set of transcription factors. By focusing on gene clustering we expect to detect the presence of transcription factor binding sites using the motif finding program Cistematic augmented with a statistical method, which will be described below.

Transcription: Permutation Statistical Method: It occurred to the author that a simple plot of occurrences by probabilities would yield visualizations of data trends output by Cistematic.

Transcription: Permutation Microarray 17808T7 was designed to identify gene expression changes that occur during shoot development in Arabidopsis. Root explants were incubated on a callus induction medium (CIM) during which time they acquire 'competence' to respond to hormones that induce shoot formation. Explants are then transferred to cytokinin-rich shoot induction medium (SIM) where they organize meristems and undergo shoot morphogenesis. Shoot Development Scan 1Shoot Developemnt Scan 2Vascular DevolpmentShoot Devlopment Scan 3Shoot Devlopment ScanVascular Development 2Shoot Development in tissue culture 1Shoot Development in tissue culture 2

Genes Gene Info AT2G22430Homeobox-leucine zipper protein 6 (HB-6) / HD-ZIP transcription factor 6, identical to homeobox-leucine zipper protein ATHB-6 (HD-ZIP protein ATHB-6) AT5G01870Lipid transfer protein, putative, similar to lipid transfer protein 6 from Arabidopsis thaliana (gi: ); contains Pfam protease inhibitor/seed storage/LTP family domain PF00234 AT5G59330 (AT5G59330: hypothetical protein) AT5G59310Lipid transfer protein 4 (LTP4), identical to lipid transfer protein 4 from Arabidopsis thaliana (gi: ); contains Pfam protease inhibitor/seed storage/LTP family domain PF00234 AT5G59320Lipid transfer protein 3 (LTP3), identical to lipid transfer protein 3 from Arabidopsis thaliana (gi: ); contains Pfam protease inhibitor/seed storage/LTP family domain PF00234) AT1G50570C2 domain-containing protein, low similarity to cold-regulated gene SRC2 (Glycine max) GI: ; contains Pfam profile PF00168: C2 domain AT2G05380Glycine-rich protein (GRP3S), identical to cDNA glycine-rich protein 3 short isoform (GRP3S) GI: ) AT2G38540Nonspecific lipid transfer protein 1 (LTP1), identical to SP|Q42589 Transcription: Genes

Transcription: Permutation Typical Cistematic output (1-mismatch):

Transcription: Permutation Here we have a plot of occurrences versus probabilities of the 15-mer data derived from microarray 17808T7. Notice the definite skew in the 15-mer Motifs(X20) graph.

Transcription: Permutation Results The following motifs have been found thus far: YTCAYAYCMARYARCCAWCAYCWCSCRCTTCCATMYRAATCCCT AT5G59310XXX AT5G59320XXX AT5G59330XXX AT2G05380XX AT2G38540XX AT1G50570XXX AT2G22430XX AT5G01870XXX

Acknowledgements Prof. Robert Vellanoweth CSULA