Takeda Pharmaceutical Inc.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Yinyin Yuan and Chang-Tsun Li Computer Science Department
Linear Models for Microarray Data
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Data Integration for Cancer Genomics. Personalized Medicine Tumor Board Question: given all we know about a patient, what is the “optimal” treatment?
Fast Bayesian Matching Pursuit Presenter: Changchun Zhang ECE / CMR Tennessee Technological University November 12, 2010 Reading Group (Authors: Philip.
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
Bayesian Robust Principal Component Analysis Presenter: Raghu Ranganathan ECE / CMR Tennessee Technological University January 21, 2011 Reading Group (Xinghao.
Visual Recognition Tutorial
Seeing the forest for the trees : using the Gene Ontology to restructure hierarchical clustering Dikla Dotan-Cohen, Simon Kasif and Avraham A. Melkman.
Lecture 5: Learning models using EM
Differentially expressed genes
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Planning operation start times for the manufacture of capital products with uncertain processing times and resource constraints D.P. Song, Dr. C.Hicks.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Graph Regularized Dual Lasso for Robust eQTL Mapping Wei Cheng 1 Xiang Zhang 2 Zhishan Guo 1 Yu Shi 3 Wei.
Tutorial 8 Clustering 1. General Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC –ArrayExpress.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Collaborative Filtering Matrix Factorization Approach
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Bayesian integration of biological prior knowledge into the reconstruction of gene regulatory networks Dirk Husmeier Adriano V. Werhli.
1 A Presentation of ‘Bayesian Models for Gene Expression With DNA Microarray Data’ by Ibrahim, Chen, and Gray Presentation By Lara DePadilla.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Online Learning for Collaborative Filtering
Function first: a powerful approach to post-genomic drug discovery Stephen F. Betz, Susan M. Baxter and Jacquelyn S. Fetrow GeneFormatics Presented by.
Abstract Background: In this work, a candidate gene prioritization method is described, and based on protein-protein interaction network (PPIN) analysis.
Gene expression analysis
Analysis of the yeast transcriptional regulatory network.
Supplementary Figure S1 eQTL prior model modified from previous approaches to Bayesian gene regulatory network modeling. Detailed description is provided.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Learning to Sense Sparse Signals: Simultaneous Sensing Matrix and Sparsifying Dictionary Optimization Julio Martin Duarte-Carvajalino, and Guillermo Sapiro.
Efficient computation of Robust Low-Rank Matrix Approximations in the Presence of Missing Data using the L 1 Norm Anders Eriksson and Anton van den Hengel.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
Sparse Signals Reconstruction Via Adaptive Iterative Greedy Algorithm Ahmed Aziz, Ahmed Salim, Walid Osamy Presenter : 張庭豪 International Journal of Computer.
By: Amira Djebbari and John Quackenbush BMC Systems Biology 2008, 2: 57 Presented by: Garron Wright April 20, 2009 CSCE 582.
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Cluster validation Integration ICES Bioinformatics.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Flat clustering approaches
Reverse engineering of regulatory networks Dirk Husmeier & Adriano Werhli.
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
A comparative approach for gene network inference using time-series gene expression data Guillaume Bourque* and David Sankoff *Centre de Recherches Mathématiques,
1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.
1 Information Content Tristan L’Ecuyer. 2 Degrees of Freedom Using the expression for the state vector that minimizes the cost function it is relatively.
An Efficient Algorithm for a Class of Fused Lasso Problems Jun Liu, Lei Yuan, and Jieping Ye Computer Science and Engineering The Biodesign Institute Arizona.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
Confidential and Proprietary Business Information. For Internal Use Only. Statistical modeling of tumor regrowth experiment in xenograft studies May 18.
Jinbo Bi Joint work with Tingyang Xu, Chi-Ming Chen, Jason Johannesen
Learning Sequence Motif Models Using Expectation Maximization (EM)
1 Department of Engineering, 2 Department of Mathematics,
Collaborative Filtering Matrix Factorization Approach
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

Takeda Pharmaceutical Inc. Integrating in Vitro Drug Sensitivity and Genomics Data for Identification of Novel Drug Pathway Associations Cong Li and Ray Liu Yale University and Takeda Pharmaceutical Inc. May 19, 2015 Presented at MBSW Muncie, IN, USA

Introduction Interests: indication selection; patient selection. Experiments : cell lines drugs response assay; microarray assay Data: IC50 and microarray gene expression current analysis practice: stepwise and test-based Our goal: develop a method that analyze available data jointly and incorporate biological information Gene Change in gene expression IC50 Drug Drug Cell line Cell line

Microarrays

Questions Often Asked Design issues Which genes are differentially expressed between the conditions? Which genes can be used to classify/predict? How? Can biological networks be inferred from these data? What are the biological stories in the data?

Drug Pathway Questions Current drug development framework typically considers the effect of a compound on a single target Pathway-based approaches for drug discovery consider the therapeutic effects of compounds in the global physiological environment For many compounds, their target pathways and mechanism of action are still unknown How to infer the target pathways of drugs?

Motivating Data Sets http://www.broadinstitute.org/ccle/ Gene expression data: Affymetrix U133+2 arrays, mapped to ~19,000 genes across over 1000 cancer cell lines; among them, 480 cell lines have available drug response data. Use genes included in two lists: (1) 766 cancer-related genes (Chen, et al., 2008); (2) 8919 genes from the Integrated Druggable Genome Database (IDGD) Project (Hopkins and Groom, 2002; Russ and Lampel, 2005). Pathway association information: Retrieved from the KEGG MEDICUS database (Kanehisa, et al., 2010). 58 pathways which are either known to be related to cancer or have drug targets. Among the genes selected in step (1), 1863 genes are covered by these 58 pathways and constitute the final list of genes in our real data analysis. Drug response data: 24 drugs annotated in the CancerResource database (Ahmed, et al., 2011). log(Activity Area). 22 drugs with known targets covered by the 58 pathways.

Overview of the 22 drugs

Activity Area (shaded area) Activity area is a combined measure of both drug potency and drug efficacy, whereas GI50 only measures drug potency.

Data Format Drug sensitivity values Basal gene expression levels (e.g. Activity Area or GI50) Basal gene expression levels (before drug treatment) Cell line 1 Cell line 2 Cell line 3 … .. gene1 gene2 gene3 gene4 …….. drug1 drug2 drug3

Model Description Spike-and-Slab mixture prior (West, 2003) for the factor loading matrix W1 and W2 to impose sparsity and utilize prior knowledge on gene-pathway and drug-pathway associations (matrix L1 and L2).

Instead of adopting a full Bayesian treatment, we use the following integrative Penalized Matrix Decomposition (iPaD) framework Note the notation differences from iFad: Y(1) is the drug response profile matrix Y(2) is gene expression profile matrix X is the pathway activity level matrix B(1) and B(2) are the pathway loading matrices for drug responses and gene expressions respectively The indexes of the non-zero elements in B(2) are known and denoted by Γ The major interest is to find the non-zero elements in B(1)

The algorithm The optimization problem in iPaD is actually a bi-convex problem, motivating the following block-wise optimization strategy: Step 1. Optimize over B(1) and B(2) while keeping X fixed Step 2. Optimize over X while keeping B(1) and B(2) fixed Step 3. Iterate between Step 1 and 2 until convergence

The algorithm When X is fixed, optimizing each column of B(1) is a LASSO problem When X is fixed, optimizing each column of B(2) is an ordinary least square (OLS) problem When B(1) and B(2) are fixed, X can optimized using an iteratively projected gradient descent algorithm

Dealing with missing values A gene/drug or cell line that is completely missing can be excluded However, partially missing genes/drugs or cell lines shall be kept in the analysis In our block-wise algorithm, B(1) and B(2) can be optimized column by column with the missing values excluded However, optimizing X is less straightforward because neither its rows nor columns can be optimized separately

We use the following soft-impute algorithm to optimize X in the presence of missing values Ω indexes the observed elements in a matrix and PΩ(*) is an operator that projecting a matrix onto the space of its observed elements.

Parameter tuning Significance test There is a parameter λ that controls the sparsity of B(1) One way to use the method is to apply a decreasing sequence of λ’s to obtain a sequence of solutions for B(1) We can also perform cross-validation on the drug response profile matrix Y(1) Green: training data; Black: testing data Significance test After finding an appropriate λ value, we can perform permutation tests to establish the significance of the identified drug-pathway associations Permute the cell lines (rows) in Y(1) while keeping Y(2) unchanged

Simulations We performed the following four sets of simulations (the 58 pathways in the real data were used; the number of drugs d = 22)   N η SNR1 SNR2 Sample Size 120 0.1 0.5 240 360 480 Sparsity of B(1) 0.02 0.05 0.2 Signal-to-Noise Ratio 0.25 1 Unbalanced Signal-to-Noise Ratio The simulated data sets were analyzed by both iFad and iPaD. Their performances were evaluated by Area Under the ROC curve (AUC)

The performances between the two methods are similar

However, iPaD is much faster The performances between the two methods are similar (cont.) However, iPaD is much faster 1000 iteration in iFad costs 4~5 days Solving a sequence of λ’s takes only ~6 minutes

Real Data Analysis We analyzed the CCLE data set described earlier with both iFad and iPaD iFad: 2,000 MCMC iterations; iPaD: 10-fold CV followed by 2,000 permutations (null distribution was approximated using a mixture of a normal distribution and a point mass at zero) We call a drug-pathway association validated if the pathway contains at least one protein targeted by the drug Among the 58 x 22 = 1276 drug-pathway pairs, 195 pairs are validated associations (195/1276 = 15.3%) Considering the randomness in the algorithms, we ran five repeats Among the top 50 drug-pathway association pairs identified by iFad, 7.0 (averaged over five repeats) pairs were validated; 16.6 for iPaD The top associations identified by iPaD were relatively consistent over the five repeats; but not consistent for iFad (probably did not converge) Running time: 2,000 MCMC iterations cost ~230 hours on a standard laptop computer (2.4GHz dual core CPU with 8G memory running on Mac OS X 10.9); 2,000 permutations cost ~6 hours for iPaD

The Chronic Myeloid Leukemia Pathway

The ErbB Signaling Pathway

Limitations/Future Work Relatively simple additive models Limited and unreliable information on pathways Pathway network topology not considered Other sources of information Tradeoff between model simplicity, computational feasibility, and real biological complexity

Thank you!