Presentation is loading. Please wait.

Presentation is loading. Please wait.

The MORPH Algorithm MORPH = MOdule guided Ranking of candidate PatHway genes high throughput data Slides: Rachel E. Bell, June 2013.

Similar presentations


Presentation on theme: "The MORPH Algorithm MORPH = MOdule guided Ranking of candidate PatHway genes high throughput data Slides: Rachel E. Bell, June 2013."— Presentation transcript:

1 The MORPH Algorithm MORPH = MOdule guided Ranking of candidate PatHway genes high throughput data Slides: Rachel E. Bell, June 2013

2 Motivation Challenges in studying biological pathways Identify missing pathway members Information gaps on participating genes: a)e.g. nature of interactions between metabolites and gene expression b)understanding control mechanisms, feedback, cross-talk Many genes in genome(s) have unknown function

3 Biological Pathways: Overview What is a pathway? A series of interactions between genes (proteins) involved in performing a certain biological function Cell input = extracellular/ endogenous: e.g.: stress, changes in PH, UV exposure, nutrients Cell output = response: e.g.: transcription of genes, sucrose degradation

4 MORPH Algorithm: Overview INPUT ALGORITHM OUTPUT High throughput data of gene expression, networks and biological pathways Machine learning and validation methods Predict genes involved in biological pathways

5 Other methods for functional prediction Coexpression-based methods (& possibly pathways) e.g.: ACT, GeneCat, ATED-II, MapMan Assumptions: 1) Similar expression patterns -> similar function or regulation 2) Pathway genes -> coordinated expression Network-based methods (& gene expression) e.g: Markov random field (MRF) models, k-nearest neighbours (k-NN), ADOMETA : coexpression, phylogeny, clustering on chrom., metabolic networks Assumption: Closer nodes -> common functions

6 Introduction: MORPH Algorithm MORPH uses pathway information, gene expression data and network information Compared to other methods, MORPH: offers robustness (performs well on many pathways) increases networks coverage applied to different organisms

7 Talk outline 1.MORPH input types: (a) gene expression data, (b) pathways and (c) networks 2.Types of clustering (modules) methods 3.The MORPH algorithm and validation 4.Results 5.Comparison to other methods 6.Summary

8 MORPH Introduction Arabidopsis Thaliana Solanum Lycopersicum (Tomato) MORPH was developed on 2 model organisms

9 MORPH Input: Arabidopsis Thaliana Pathways: 66 AraCyc, 164 MapMan Preprocessing: filter pathways with <10 genes with expression data Total 230 pathways, 2 sets Gene Expression datasets: seedlings, tissues (leaves, roots, flowers, seeds), seed developmental stages, DS1 Preprocessing: filter low variance and detection call, average replicates, normalize to controls, standardize experiments Total 216 GE profiles, 4 datasets, ~12500 genes

10 MORPH Input: Arabidopsis Thaliana Metabolic (MD) Network (AraCyc) Node = metabolic genes (enzymes) Edges = nodes share a metabolite (reactant or product) Preprocessing: remove most common metabolites (they connect enzymes with weak functional associations) Total: 1987 genes, interactions PPI Network (PAIR & Interactome Map databases) Node = genes (proteins) Edges = interactions between proteins Preprocessing: Unite (predicted & expt.) interactions from both databases Total: 4642 genes, interactions

11 Talk outline 1.MORPH input types: (a) gene expression data, (b) pathways and (c) networks 2.Types of clustering (modules) methods 3.The MORPH algorithm and validation 4.Results 5.Comparison to other methods 6.Summary

12 MORPH Goal MORPH goal: Given a specific biological pathway MORPH seeks candidate genes that participate in (or regulate) the pathway. A key step in MORPH is the partitioning of genes into modules (clusters). MORPH receives 3 types of input: 1.Pathways 2.Gene expression data 3.Partitioning into modules

13 Assumptions of clustering data into modules Q: Why use modules? Modules reflect broad functions Some functions are related to target pathway Pathway genes -> more coordinated expression than random genes

14 Different strategies for partitioning genes Expression based clustering Network based clustering Input: Partitioning Gene Modules and Networks Annotation based clustering SOM = self-organizing map (partitions all genes) CLICK = CLuster Identification via Connectivity Kernels (partitions most genes) Enzyme / not enzyme Orthologs in rice & maize / no orthologs Matisse * Markov cluster algorithm (MCL)

15 Input: Partitioning Networks Reminder: MATISSE seeks connected sub-networks with high expression similarity Interaction High expression similarity (Ulitsky & Shamir, 2007) Goal: construct modules using gene expression data and networks Problem: low coverage of MD network

16 Input: Partitioning Networks - MATISSE* Results: Matisse* increased MD network coverage to ~4500 genes Matisse* performed similarly to Matisse Motivation - overcome low coverage of networks MATISSE* (modified MATISSE) Add genes with high correlation Repeat until module correlation <0.4 Connectivity ignored

17 Clustering algorithmMethod SOMCo-expression CLICKCo-expression Clustering algorithmNetwork Markov cluster process (MCL)PPI MATISSE*PPI MATISSE*MD network Gene expression-based clustering Modules using network data BipartitionCategories EnzymesY/N OrthologsY/N Summary: Methods of Partitioning Gene Modules and Networks Total of 8 clustering solutions No clustering - single module Annotation-based clustering

18 Talk outline 1.MORPH input types: (a) gene expression data, (b) pathways and (c) networks 2.Types of clustering (modules) methods 3.The MORPH algorithm and validation 4.Results 5.Comparison to other methods 6.Summary

19 MORPH = MOdule guided Ranking of candidate PatHway genes Input: 1.Pathway genes S = {s 1,s 2,…s l } 2.Gene expression profiles 3.Partition solution for genes with gene expression data: k modules = M 1…… M k 4.Similarity function (D) Pearson/Spearman MORPH is an algorithm for prioritizing novel candidate genes in a given specific pathway.

20 Module-Guided Ranking Algorithm Step #1: Partition genes into k modules M 1,M 2,…,M k #1 #2 #3 Step #3: Analyze each module separately Step #2: Identify pathway genes s 1,s 2,…,s l and candidate genes g ignore modules with no pathway genes add module for non partitioned pathway genes

21 Step #4: For each g (candidate gene) in module M i calculate mean similarity with s j (pathway genes) using gene expression data Module-Guided Ranking Algorithm candidate genes pre-defined module Similarity function (Pearsons Corr.) pathway genes in module provides ranking within module #3 #4

22 Step #5: Standardize mean similarity scores within each module candidate genes stdev / mean of mean similarity scores of all candidate genes in module M i Step #6: Rank all candidate genes (using standardized z-scores) #5 #6 Module-Guided Ranking Algorithm

23 How do we assess predictions of many pathways? Given a clustering solution AND gene dataset run algorithm for each pathway Arabidopsis Thaliana 230 pathways Assessment of pathways using Leave-One-Out Cross-Validation (LOOCV) procedure

24 Kharchenko et al., 2006 Leave-One-Out Cross-Validation (LOOCV) procedure LOOCV generates for each pathway gene -> SELF-RANK SELF RANK of a gene is its position in ranking, when left out of algorithm calculation Definition Self rank of pathway gene = its overall strength of association with remaining pathway genes Meaning

25 Self-Rank Curve: AUSR score LOOCV procedure For each pathway S: 1.Remove one gene (v) -> S\{v} 2.Consider S\{v} = test set 3.Generate ranking of v using S\{v} 4.Repeat for every v Calculate self-rank for all v in S Create self-rank plot Self-rank threshold of k= Calculate area under self-rank curve (AUSR) Self-Rank plot of the Carotenoid Biosynthetic Pathway contains 13 genes; SOM - clustering solution Figure 2 (Random gene set of size 13 genes) (k) AUSR score assesses pathway solutions (given input combinations – discussed next)

26 Talk outline 1.MORPH input types: (a) gene expression data, (b) pathways and (c) networks 2.Types of clustering (modules) methods 3.The MORPH algorithm and validation 4.Results 5.Comparison to other methods 6.Summary

27 FIGURE 3: Comparison of 2 gene expr. datasets AUSR (seedlings) - AUSR (DS1) Different: gene expression dataset Same: MD network, Matisse*, 66 AraCyc Pathways Inspired adoption of selection (learning configuration) Different input produces different AUSR scores

28 Learning Configuration Every pathway tested with gene expression dataset and partitioning solution (modules) Total of 4x8 = 32 combinations Learning configuration = combination of: gene expression dataset (4) AND Clustering solution (8) Definition

29 Machine Learning LOOCV used to select optimal learning configuration (i.e. data set and clustering) for each examined pathway. LOOCV avoids overfitting, since test gene is left out. MORPH applies a selection procedure

30 Comparison of selection process to other fixed configurations Results Better: enzymes or MD network Poorer: PPI network, no clustering, SOM, CLICK & Orthologs (metabolic genes had higher corr.) Selection improved on all configurations Figure 4: The average AUSR for each learning combination (gene expr. dataset + clustering solution) 66 AraCyc metabolic pathways

31 Robustness of selection method Real vs. Random Pathways randomly selected sets with same size (repeated 100 times for each size) Results 29/66 AUSR > maximal random score AUSR > /66 - real pathways 0 - random 66 AraCyc pathways Figure 5: AUSR Scores of Real and Random Pathways Sizes AUSR

32 Talk outline 1.MORPH input types: (a) gene expression data, (b) pathways and (c) networks 2.Types of clustering (modules) methods 3.The MORPH algorithm and validation 4.Results 5.Comparison to other methods 6.Summary

33 Comparison of MORPH to other methods: Arabidopsis Thaliana pathways 66 AraCyc Pathways Input: Gene expression: seeds, tissues, seedlings, DS1 Networks: PPI and MD networks Pathways: AraCyc, MapMan Coexpression (no network data) methods using reference datasets: ACT, DS1 Markov Ranking Field (MRF) methods (network data) CMRF = total # of pathway gene in neighbourhood WMRF= total similarity with path. genes in neighbourhood k-Nearest Neighbour (k-NN) (network data) Figures 4B & 4C 164 MapMan Pathways * *

34 AraCyc pathways with AUSR>0.8 MapMan pathways with AUSR>0.7 k-NN predictor complements MORPH Figure 4D & 4E: Comparison to other methods

35 My analysis: AUSR scores of MORPH and k-NN k-NN is twice as good as MORPH for high AUSRs >0.9 (6 compared to 3) Data retrieved from Supplemental Data Set 3

36 Carotenoid Pathway and the MORPH Candidate genes Carotenoids are antioxidants, perform stress response functions Candidate Genes (Numbered Octagons) 8/25 top candidates have predicted functions, with little details of roles in plants Other predictions inc. genes with similar functions – response to oxidative stress SQE3 –catalyzes the precursor of a pathway which is coordinated expression with the carotenoid pathway SPS2 – Plastoquinone pathway essential for carotenoid pathway

37 Predictors include MORPH, k-NN, MRF-based, and coexpression based classifiers. (A) Average and median AUSR scores. (B) The number of pathways that had AUSR score above 0.7 Comparison of MORPH to other methods 93 Tomato pathways Figure 7

38 Talk outline 1.MORPH input types: (a) gene expression data, (b) pathways and (c) networks 2.Types of clustering (modules) methods 3.The MORPH algorithm and validation 4.Results 5.Comparison to other methods 6.Summary

39 Summary: Advantages of MORPH 1.Robust – different pathways 2.k-NN consider only genes in the network, MORPH increases network coverage 3.k-NN more dependent on sub-networks diameter (higher diameter lower AUSR), MORPH more robust 4.Self-rank k=1000 threshold for AUSR, ignores poor pathway gene correlations 5.Potential useful predictions

40 Summary: Drawbacks of MORPH 1.If pathway genes not coherent, better select best/top module(s) than average 2.Dependent on input quality (e.g. AraCyc > MapMan) 3.Predicts close pathways (drawback/advantage) 4.Requires known pathway info for predictions

41 Questions?

42

43 Top AUC scores for tested pathways PathwaySpearman AUCPearson AUCSize photosynthesis light reactions Chlorophyllide biosynthesis I Carotenoids Core pathway tRNA charging pathway gluconeogenesis triacylglycerol degradation cysteine biosynthesis I fatty acid β-oxidation II (core pathway) glycolysis I glycolysis IV (plant cytosol) Calvin-Benson-Bassham cycle glucosinolate biosynthesis from homomethionine homogalacturonan biosynthesis glucosinolate biosynthesis from hexahomomethionine glucosinolate biosynthesis from pentahomomethionine ethylene biosynthesis from methionine

44 MORPH Classifications 3 types of input data: Pathways genes (s 1,s 2,…s l ) Gene expression Partition gene expression data into k modules = M 1,…,M k 66 Arabidopsis Thaliana 4 datasets 8 Partitioning methods

45


Download ppt "The MORPH Algorithm MORPH = MOdule guided Ranking of candidate PatHway genes high throughput data Slides: Rachel E. Bell, June 2013."

Similar presentations


Ads by Google