Gene expression analysis

Slides:



Advertisements
Similar presentations
A Comparative mapping resource ONTOLOGY DEVELOPMENT AND INTEGRATION IN GRAMENE Pankaj Jaiswal Cornell University.
Advertisements

BioInformatics (3).
Basic Gene Expression Data Analysis--Clustering
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Gene Shaving – Applying PCA Identify groups of genes a set of genes using PCA which serve as the informative genes to classify samples. The “gene shaving”
Cluster analysis for microarray data Anja von Heydebreck.
Introduction to Bioinformatics
Gene Ontology John Pinney
Gene function analysis Stem Cell Network Microarray Course, Unit 5 May 2007.
Introduction to Functional Analysis J.L. Mosquera and Alex Sanchez.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Microarray GEO – Microarray sets database
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
COG and GO tutorial.
Figure 1: (A) A microarray may contain thousands of ‘spots’. Each spot contains many copies of the same DNA sequence that uniquely represents a gene from.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Biology 224 Dr. Tom Peavy Sept 27 & 29 Protein Structure & Analysis- part 2.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Today’s menu: -SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Introduction to Bioinformatics - Tutorial no. 12
Gene Expression 1. Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC EPCLUST 2.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
1 Cluster Analysis EPP 245 Statistical Analysis of Laboratory Data.
ICA-based Clustering of Genes from Microarray Expression Data Su-In Lee 1, Serafim Batzoglou 2 1 Department.
Protein and Function Databases
Tutorial 8 Clustering 1. General Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC –ArrayExpress.
Cluster Analysis Hierarchical and k-means. Expression data Expression data are typically analyzed in matrix form with each row representing a gene and.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
GCB/CIS 535 Microarray Topics John Tobias November 15 th, 2004.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
 2 Outline  Review of major computational approaches to facilitate biological interpretation of  high-throughput microarray  and RNA-Seq experiments.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Evaluating Performance for Data Mining Techniques
Automatic methods for functional annotation of sequences Petri Törönen.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
Gene Expression Omnibus (GEO)
Annotating Gene Products to the GO Harold J Drabkin Senior Scientific Curator The Jackson Laboratory Mouse.
Biology 224 Instructor: Tom Peavy Feb 21 & 26, Protein Structure & Analysis.
GENE ONTOLOGY FOR THE NEWBIES Suparna Mundodi, PhD The Arabidopsis Information Resources, Stanford, CA.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Monday, November 8, 2:30:07 PM  Ontology is the philosophical study of the nature of being, existence or reality as such, as well as the basic categories.
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Tutorial 7 Gene expression analysis 1. Expression data –GEO –UCSC –ArrayExpress General clustering methods –Unsupervised Clustering Hierarchical clustering.
Functional Annotation and Functional Enrichment. Annotation Structural Annotation – defining the boundaries of features of interest (coding regions, regulatory.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Bioinformatics Shared Resource Introduction to Gene Expression Omnibus (GEO) bsrweb.sanfordburnham.org
Unsupervised Learning
Clustering CSC 600: Data Mining Class 21.
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Tutorial 6 : RNA - Sequencing Analysis and GO enrichment
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Gene expression analysis
SEEM4630 Tutorial 3 – Clustering.
Clustering.
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Unsupervised Learning
Presentation transcript:

Gene expression analysis Tutorial 7 Gene expression analysis

Gene expression analysis How to interpret an expression matrix Expression data DBs - GEO General clustering methods Unsupervised Clustering Hierarchical clustering K-means clustering Tools for clustering - EPCLUST Functional analysis - Go annotation

Gene expression data sources Microarrays RNA-seq experiments

How to interpret an expression data matrix Exp1 /Sample 1 Exp2 /Sample 2 Exp3 /Sample 3 Exp4 /Sample 4 Exp5 /Sample 5 Exp6 /Sample 6 Gene 1 -1.2 -2.1 -3 -1.5 1.8 2.9 Gene 2 2.7 0.2 -1.1 1.6 -2.2 -1.7 Gene 3 -2.5 1.5 -0.1 -1 0.1 Gene 4 2.6 2.5 -2.3 Gene 5 2.2 Gene 6 -2.9 -1.9 -2.4 Each column represents all the gene expression levels from: In two-color array: from a single experiment. In one-color array: from a single sample. Each row represents the expression of a gene across all experiments.

How to interpret an expression data matrix Exp1 /Sample 1 Exp2 /Sample 2 Exp3 /Sample 3 Exp4 /Sample 4 Exp5 /Sample 5 Exp6 /Sample 6 Gene 1 -1.2 -2.1 -3 -1.5 1.8 2.9 Gene 2 2.7 0.2 -1.1 1.6 -2.2 -1.7 Gene 3 -2.5 1.5 -0.1 -1 0.1 Gene 4 2.6 2.5 -2.3 Gene 5 2.2 Gene 6 -2.9 -1.9 -2.4 Each element is a log ratio: In two-color array: log2 (T/R). T - the gene expression level in the testing sample R - the gene expression level in the reference sample In one-color array: log2(X) X - the gene expression level in the current sample

How to interpret an expression data matrix In two-color array: Scale In one-color array: Scale Red indicates a positive log ratio: T>R Bright green indicates a high expression value Black indicates a log ratio of zero: T=~R Green indicates a positive log ratio: T>R Black indicates no expression Expr.1 Expr.2 Expr.3 Expr.4 Expr.5 Expr.6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Samp 1 Samp 2 Samp 3 Samp 4 Samp 5 Samp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6

Different representations Microarray Data: Different representations T>R Log ratio Log ratio T<R Exp Exp

How to analyze gene expression data

Expression profiles DBs GEO (Gene Expression Omnibus) http://www.ncbi.nlm.nih.gov/geo/ Human genome browser http://genome.ucsc.edu/ ArrayExpress http://www.ebi.ac.uk/arrayexpress/

The current rate of submission and processing is over 10,000 Samples per month. In 2002 Nature journals announce requirement for microarray data deposit to public databases.

Searching for expression profiles in the GEO http://www.ncbi.nlm.nih.gov/geo/ *further curated= statistically comparable datasets

GEO accession IDs GPL**** - platform ID GSM**** - sample ID GSE**** - series ID GDS**** - dataset ID A Series record denes a set of related Samples considered to be part of a group. A GDS record represents a collection of biologically and statistically comparable GEO samples. Not every experiment has a GDS.

Clustering Statistic analysis Download dataset

Clustering analysis

Clustering analysis – zoom in

Clustering analysis – zoom in

Viewing the expression levels

Viewing the expression levels

Grouping together “similar” genes Clustering Grouping together “similar” genes

Clustering Unsupervised learning: The classes are unknown a priori and need to be “discovered” from the data. Supervised learning: The classes are predefined and the task is to understand the basis for the classification from a set of labeled objects. This information is then used to classify future observations. http://www.bioconductor.org/help/course-materials/2002/Seattle02/Cluster/cluster.pdf

Unsupervised Clustering Hierarchical methods - These methods provide a hierarchy of clusters, from the smallest, where all objects are in one cluster, through to the largest set, where each observation is in its own cluster. Partitioning methods - These usually require the specification of the number of clusters. Then a mechanism for apportioning objects to clusters must be determined. http://www.bioconductor.org/help/course-materials/2002/Seattle02/Cluster/cluster.pdf

Hierarchical Clustering This clustering method is based on distances between expression profiles of different genes. Genes with similar expression patterns are grouped together.

Rings a bell?... In both phylogenetic trees and in clustering we create a tree based on distances matrix. When computing phylogenetic trees: We compute distances between sequences. When computing clustering dendograms we compute distances between expression values. Expr.1 Expr.2 Expr.3 Expr.4 Expr.5 Expr.6 Gene 1 Gene 2 ATCTGTCCGCTCG ATGTGTGCGCTTG Score Score

How to determine the similarity between two genes? Patrik D'haeseleer, How does gene expression clustering work?, Nature Biotechnology 23, 1499 - 1501 (2005) , http://www.nature.com/nbt/journal/v23/n12/full/nbt1205-1499.html

Hierarchical clustering methods produce a tree or a dendrogram. They avoid specifying how many clusters are appropriate by providing a partition for each K. The partitions are obtained from cutting the tree at different levels. 2 clusters 4 clusters 6 clusters

The more clusters you want the higher the similarity is within each cluster. http://discoveryexhibition.org/pmwiki.php/Entries/Seo2009

Hierarchical clustering results http://www.spandidos-publications.com/10.3892/ijo.2012.1644

Unsupervised Clustering – K-means clustering An algorithm to classify the data into K number of groups. K=4

How does it work? 1 2 3 4 The centroid of each of the k clusters becomes the new means. k initial "means" (in this casek=3) are randomly selected from the data set (shown in color). k clusters are created by associating every observation with the nearest mean Steps 2 and 3 are repeated until convergence has been reached. The algorithm iteratively divides the genes into K groups and calculates the center of each group. The results are the optimal groups (center distances) for K clusters.

How should we determine K? Trial and error Take K as square root of gene number

Tool for clustering - EPclust http://www.bioinf.ebc.ee/EP/EP/EPCLUST/

Choose distance metric Choose algorithm

Hierarchical clustering

Zoom in by clicking on the nodes

K-means clustering K-means clustering

Samples found in cluster Graphical representation of the cluster Graphical representation of the cluster

10 clusters, as requested

Now what? Now that we have clusters – we want to know what is the function of each group. There is a need for some kind of generalization for gene functions.

Gene Ontology (GO) http://www.geneontology.org/ The Gene Ontology project provides an ontology of defined terms representing gene product properties. The ontology covers three domains:

Gene Ontology (GO) Cellular Component (CC) - the parts of a cell or its extracellular environment. Molecular Function (MF) - the elemental activities of a gene product at the molecular level, such as binding or catalysis. Biological Process (BP) - operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms.

The GO tree

GO sources ISS Inferred from Sequence/Structural Similarity IDA Inferred from Direct Assay IPI Inferred from Physical Interaction TAS Traceable Author Statement NAS Non-traceable Author Statement IMP Inferred from Mutant Phenotype IGI Inferred from Genetic Interaction IEP Inferred from Expression Pattern IC Inferred by Curator ND No Data available IEA Inferred from electronic annotation

DAVID http://david.abcc.ncifcrf.gov/   DAVID  http://david.abcc.ncifcrf.gov/ Functional Annotation Bioinformatics Microarray Analysis Identify enriched biological themes, particularly GO terms Discover enriched functional-related gene/protein groups Cluster redundant annotation terms Explore gene names in batch 

annotation classification ID conversion

Functional annotation Genes from your list involved in this category Upload Charts for each category Charts for each category Charts for each category

Minimum number of genes for corresponding term Maximum EASE score/ E-value Genes from your list involved in this category Genes from your list involved in this category Enriched terms associated with your genes Source of term E-Value

A group of terms having similar biological meaning due to sharing similar gene members

Gene expression analysis How to interpret an expression matrix Expression data DBs - GEO General clustering methods Unsupervised Clustering Hierarchical clustering K-means clustering Tools for clustering - EPCLUST Functional analysis - Go annotation