Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Slides:



Advertisements
Similar presentations
Periodic clusters. Non periodic clusters That was only the beginning…
Advertisements

. Context-Specific Bayesian Clustering for Gene Expression Data Yoseph Barash Nir Friedman School of Computer Science & Engineering Hebrew University.
Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale DeRisi, Iyer, and Brown (1997) Science 278,
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
August 19, 2002Slide 1 Bioinformatics at Virginia Tech David Bevan (BCHM) Lenwood S. Heath (CS) Ruth Grene (PPWS) Layne Watson (CS) Chris North (CS) Naren.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
Comparative Motif Finding
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Finding Regulatory Motifs in DNA Sequences
Review of important points from the NCBI lectures. –Example slides Review the two types of microarray platforms. –Spotted arrays –Affymetrix Specific examples.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Comparative Expression Moran Yassour +=. Goal Build a multi-species gene-coexpression network Find functions of unknown genes Discover how the genes.
Why microarrays in a bioinformatics class? Design of chips Quantitation of signals Integration of the data Extraction of groups of genes with linked expression.
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
Gene Expression Analysis using Microarrays Anne R. Haake, Ph.D.
Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale Joseph L. DeRisi, Vishwanath R. Iyer, Patrick O. Brown Science Vol. 278.
Analysis of microarray data
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Genome of the week - Deinococcus radiodurans Highly resistant to DNA damage –Most radiation resistant organism known Multiple genetic elements –2 chromosomes,
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Reconstructing Gene Networks Presented by Andrew Darling Based on article  “Research Towards Reconstruction of Gene Networks from Expression Data by Supervised.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Gene Expression Data Qifang Xu. Outline cDNA Microarray Technology cDNA Microarray Technology Data Representation Data Representation Statistical Analysis.
Fig Chapter 12: Genomics. Genomics: the study of whole-genome structure, organization, and function Structural genomics: the physical genome; whole.
Gene structure in prokaryotes * In prokaryotic cells such as bacteria, genes are usually found grouped together in operons. * The operon is a cluster of.
Finish up array applications Move on to proteomics Protein microarrays.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Searching for structured motifs in the upstream regions of hsp70 genes in Tetrahymena termophila. Roberto Marangoni^, Antonietta La Terza*, Nadia Pisanti^,
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Gene Expression and Networks. 2 Microarray Analysis Supervised Methods -Analysis of variance -Discriminate analysis -Support Vector Machine (SVM) Unsupervised.
Bioinformatics MEDC601 Lecture by Brad Windle Ph# Office: Massey Cancer Center, Goodwin Labs Room 319 Web site for lecture:
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Biological Networks & Systems Anne R. Haake Rhys Price Jones.
EB3233 Bioinformatics Introduction to Bioinformatics.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Extracting binary signals from microarray time-course data Debashis Sahoo 1, David L. Dill 2, Rob Tibshirani 3 and Sylvia K. Plevritis 4 1 Department of.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Cluster validation Integration ICES Bioinformatics.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Disease Diagnosis by DNAC MEC seminar 25 May 04. DNA chip Blood Biopsy Sample rRNA/mRNA/ tRNA RNA RNA with cDNA Hybridization Mixture of cell-lines Reference.
Inference with Gene Expression and Sequence Data BMI/CS 776 Mark Craven April 2002.
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Transcription factor binding motifs (part II) 10/22/07.
ABSTRACT First genomic scale data about gene expression have recently started to become available in addition to complete genome sequence data and annotations.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
The Transcriptional Landscape of the Mammalian Genome
Bioinformatics tools to identify structured motifs in the upstream regions of stress-response-involved genes in Tetrahymena thermophila Antonietta La Terza*,
Microarray Technology and Applications
Control of Gene Expression
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
Finding regulatory modules
Mapping Global Histone Acetylation Patterns to Gene Expression
Nora Pierstorff Dept. of Genetics University of Cologne
Presentation transcript:

Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute

Why the yeast is interesting to the industry l Easy to work with (first) fully sequenced eukaryotic model organism l 30% of genes have analogs in human l most known human disease genes have homologues in the yeast l for food industry interesting in itself

Genetic networks promoter 1 gene 1 promoter 2 gene 2 promoter 3 gene 3 promoter 4 gene 4 DNA RNA transcription translation proteins transcription factors

Mining the Yeast Expression Data l The long term goals: »reconstructing the gene regulation networks and relating it to metabolic pathways l Short term goals: »correlating gene expression profiles with gene functional classes and using this for prediction of gene functions »correlating gene expression profiles with promoter regions

Yeast microarray

Yeast gene expression during diauxic shift (DeRisi et al ) Yeast cells from an exponentially growing yeast culture were inoculated into fresh medium and after some initial period were harvested at seven 2-hour intervals. Their mRNA were isolated, and fluorescently labeled cDNA prepared. Two different fluorescents were used - one from cells harvested in each of the successive time-points, other from the cells harvested at the first time-point (reference measurement). The cDNA from each time-point together with the reference cDNA were hybridized to the microarray with approximately 6400 DNA sequences representing ORFs of the yeast genome. Measurements of the relative fluerescence intensity for each element reflect the relative abundance of the corresponding mRNA.

Visualizing the data (expression profile of the “first” 250 genes)

Average expression level of genes at the respective time-points

Three approaches l Finding correlations between gene expression profiles and their functional classes l Building decision trees for predicting gene functional classes from their expression data l In silico discovery of putative transcription factor binding sites in the regions upstream to the genes with similar expression profiles (to appear in Genome Research, Dec. 1998)

Gene distribution across the functional classes

Energy gene subclasses in the yeast (less frequent merged in one)

Gene expression for energy genes during the diauxic shift at the seven time-points

Expression profiles of respiration genes

Expression profiles of fermentation genes

Average expression levels at the 7 time-points and for energy class genes during diauxic shift

Average expression levels at all time- points and for all energy classes

Energy classes distribution

Decision tree for respiration genes

Decision tree for fermentation

Tricarboxilacid, respiration and reserves decision tree

Clustering the gene expression profiles by discretization of gene expression measurment space Logarithm of expression ratio Time points Corresponding discrete pattern: Put the genes mapping to the same discrete pattern in a cluster

Organization of a typical yeast promoter URS TATAI Coding Region bp bp RNA bp

In silico discovery of transcription factor binding sites from expression data Take data from gene expression level measurements (from DNA array technologies) -> Cluster together genes with similar expression profiles -> Take sequences upstream from the genes in each cluster -> Look for sequence patterns overrepresented in a cluster

Clustering genes by similar expression profiles l Put in each cluster all genes that map to the same discrete pattern l Different thresholds give different clustering systems l We obtained 32 different clusters containing from 10 to 77 genes and 11 clusters containing at least 25 genes

Hypothesis to test Genes with similar expression profiles may be regulated by similar expression mechanisms and thus may contain similar transcription factor binding sites

Discovering regulatory elements in gene upstream sequences l Take the sequences of a certain length (e.g., 300 bp) upstream to all genes with a certain expression profile l Look for a priori unknown sequence patterns that are over-represented in these regions (taking into account the other upstream regions as background)

Pattern discovery in bioseqeucnes l Group together sequences thought to have common biological (structural, functional) properties, ignoring the purely sequence (syntactic) properties l Study the purely syntactic properties of these sequences ignoring their biological (semantic) properties.

Problem of “noise” l Gene expression measurement accuracy is bout factor of 2 (in 95% cases) l Clusters very dependant on the clustering method or thresholds l The same expression profile does not necessarily mean the same regulation mechanism

Dealing with noise l One cannot look for patterns common to the set of strings, but for patterns overrepresented in the set l looking for sets of patterns covering the set l Use of “negative” or background setquences

More powerful algorithms than the currently existing are needed l We used such new, more powerful algorithm, based on suffix-tree representation of the sequence space (implemented by Jaak Vilo at Helsinki University) l We looked systematically for all patterns discriminating the upstream regions in the clusters from randomly selected upstream regions

Use of negative sequences Looking for patterns that are overrepresented in the sequences upstream from genes in a cluster in comparison to all other upstream sequences

The rating function l Given two sets S + and S - and a pattern P, return rating R(S +, S -,P) l Two rating functions that we used: »ratio: nr of sequences in S + matching P divided by nr of sequences in S - matching P »probability that the pattern can occur in S + “by chance” assuming that the occurrences in S - are “by chance” and using binomial distribution

The sequence pattern discovery experiment l We run the algorithm on upstream sequences (length 2 * 300) of all the 32 gene clusters l Each cluster produced hundreds of overrepresented patterns l The problem of validation

Some discovered sequence patterns from clusters of upstream sequences l Clusters with the increase in the expression level after time-point 6: CCCCT - known to be a stress responsive motif l Clusters with the decrease in the expression level after time-point 6: ATCC..T..A - RAP1 protein ATC..TAC - RAP1, REB1, BAF1 ATTTCA…T - GA-BF protein

Statistical validation of the discovered patterns l For each cluster choose a random set of upstream regions of the same number l Run the pattern discovery algorithm on the random regions set in addition to the cluster l Compare the scores of the discovered patterns from the cluster and random set

Conclusions l The discovered patterns are in accordance with the existing knowledge l Transcription factor binding sites can be discovered in silico from gene expression data l More refined and validated gene expression measurements are needed

Acknowledgements l Inge Jonassen (Bergen) l Jaak Vilo, Esko Ukkonen (Helsinki) l Alistair Ewing, Neil Skilling (Quadstone Ltd - developers of Decisionhouse data mining software) l BIOVIS and BIOSTANDARDS projects from the EU at EBI