Presentation on theme: "Expression Profile = the pattern of signal values for one gene over several chips. Expression Profile Clustering = the clustering of similar profiles Why?"— Presentation transcript:
Expression Profile = the pattern of signal values for one gene over several chips. Expression Profile Clustering = the clustering of similar profiles Why? –Similar expression profiles suggest regulation (by shared factor or cluster member) related function –ALSO: opposite expression profiles suggest regulation (e.g. inhibition) Software: –EpClust –J-Express –Etc…. Expression Profile Clustering
Data file format Can enter in a variety of formats. –See EpClusts data upload page One simple way: –Download tab-delimited data from NASC –Open with (or paste into) excel –Delete all but one name column and Signal value columns. –Do not include any hyphens! –(I try to avoid all punctuation) –Save as tab delimited text Annotation can be added as either: –A second file with the same first column of names –A single second column in the signal file (if specified) Gene_IDChip1Chip2Chip3 AT1G01010404865 AT1G01030241 AT1G01040798835 AT1G01050713744671 AT1G010605680111 AT1G010707510994 AT1G010803933
To Choose Algorithm Type Hierarchical Measures distance between each profile (i.e. gene) with each other profile So time increases exponentially with each gene Then clusters closest genes together, followed by increasingly distant ones Into a tree of clusters within clusters can be visualised and finally split at chosen distance with knowledge of tree K-means Initially choose a set number (K) of clusters 1) Chooses the K most different profiles 2) Clusters each remaining profile with one of the K So, much less time than Hierarchical (if many genes) Cluster size can vary so can return too many or few genes with no way to select. Then must repeat with different K values.
So Which is Better? K-Means is better able to manage large data-sets. Hierarchical seams a more objective aproach –In that you dont need to decide cluster number at start But which is more biologically informative? Opinion divided. Both artificial. Dont prove, just suggest. K-means gives more consistent results. ….and then theres SOTA too!
The most commonly used How similar the SHAPES of the two profiles are Based on average of values and the standard deviation Rates from identical (1) to completely uncorrelated (0) to perfect opposites (-1) Centred Test? If profiles have identical shape, but offset from each other by a fixed value (or magnitude) Centred: Identical (1) Uncentred: Not (<1) AbsoluteTest? If profiles have perfect opposite expression patterns Absolute: Identical (1) Non-Absolute: perfect opposites (-1) Parametric Test? Parametric: assumes normal distribution. More rigorous where there are no outliers. Non-parametric (=spearman rank): More rigorous where there are outliers. Pearson-Based Distance Measurement
Euclidean-Based Distance Measurement between gene expression levels directly based on magnitude of changes More about signal VALUES. Less about profile shapes. data must be suitably normalized e.g. use log-ratios of signals Euclidean: shortest path between points Manhattan: the sum of distances along each dimension
Phylip Format Extra Options Search for Promoter Motifs Format Results PictureExport Lists as TextStatistics Export clusters as: text (lists of gene names) tree cut at chosen height. Or formatted for Phylip (tree drawing software) Visualise promoter region next to each profile Highlight motifs You must tell it the sequences of motifs though. Must prepare file of upstream sequences Currently must arrange with EpClust staff to upload Can get from querying a database such as PlantCare using the upstream sequence of our particular gene of most interest Can get from TAIR using a list of gene names
PlantCARE Cis-Acting Regulatory Elements http://intra.psb.ugent.be:8080/PlantCARE/ Database of CAREs Tools –Search for CARE Enter upstream sequence of a gene Lists and then highlights known motifs from database –Motif Sampler Enter upstream sequences of your cluster Highlights 8mers conserved more in cluster members than in genome –Other tools Clustering Query for info on motifs
Gene Ontology A hierarchical structure to describe gene function. As PlantCARE compares: Expression Profile Clusters to Promoter Motif Conservation There are also tools to compare: Expression Profile Clusters to Gene Ontology