Multi-Task Learning for HIV Therapy Screening Steffen Bickel, Jasmina Bogojeska, Thomas Lengauer, Tobias Scheffer.

Slides:

Advertisements

Similar presentations

CSI :Florida A BAYESIAN APPROACH TO LOCALIZED MULTI-KERNEL LEARNING USING THE RELEVANCE VECTOR MACHINE R. Close, J. Wilson, P. Gader.

Advertisements

Linear Regression.

Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.

On-line learning and Boosting

Pattern Recognition and Machine Learning

Support Vector Machines

Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.

An Overview of Machine Learning

Chapter 4: Linear Models for Classification

Learning on Probabilistic Labels Peng Peng, Raymond Chi-wing Wong, Philip S. Yu CSE, HKUST 1.

Probabilistic Generative Models Rong Jin. Probabilistic Generative Model Classify instance x into one of K classes Class prior Density function for class.

Graphical Multi-Task Learning Dan Sheldon Cornell University NIPS SISO Workshop 12/12/2008.

The Kernel Trick Kenneth D. Harris 3/6/15.

Classification and risk prediction

Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Today Linear Regression Logistic Regression Bayesians v. Frequentists

Sample Selection Bias Lei Tang Feb. 20th, Classical ML vs. Reality  Training data and Test data share the same distribution (In classical Machine.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.

Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Ensemble Learning (2), Tree and Forest

Radial Basis Function Networks

Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.

Who would be a good loanee? Zheyun Feng 7/17/2015.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based.

1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

@delbrians Transfer Learning: Using the Data You Have, not the Data You Want. October, 2013 Brian d’Alessandro.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

Classification Techniques: Bayesian Classification

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

29 August 2013 Venkat Naïve Bayesian on CDF Pair Scores.

Classification Ensemble Methods 1

Logistic Regression William Cohen.

NTU & MSRA Ming-Feng Tsai

Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.

Transfer and Multitask Learning Steve Clanton. Multiple Tasks and Generalization “The ability of a system to recognize and apply knowledge and skills.

Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer Science Dept. Carnegie Mellon University SIGIR.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Introduction to Machine Learning Nir Ailon Lecture 11: Probabilistic Models.

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani

ECE 5424: Introduction to Machine Learning

Machine Learning Basics

Learning with information of features

Pattern Recognition and Machine Learning

Model generalization Brief summary of methods

Parametric Methods Berlin Chen, 2005 References:

Multivariate Methods Berlin Chen

Logistic Regression Chapter 7.

Multivariate Methods Berlin Chen, 2005 References:

Machine Learning – a Probabilistic Perspective

Presentation transcript:

Multi-Task Learning for HIV Therapy Screening Steffen Bickel, Jasmina Bogojeska, Thomas Lengauer, Tobias Scheffer

Bickel, Bogojeska, Lengauer, Scheffer HIV Therapy Screening Usually combinations (3-6 drugs) out of around 17 antiretroviral drugs administered. Effect of combinations on virus similar but not identical. Scarce training data available from treatment records. Challenge: Prediction of therapy outcome from genotypic information. data for combination 1 data for combination 2 successful treatmentfailed treatment data for comb. 3

Bickel, Bogojeska, Lengauer, Scheffer Multi-Task Learning Several related prediction problems (tasks).  Not necessarily identical conditional p ( y | x ) of label given input.  Usually, some conditionals are similar. Challenge:  Use all available training data and account for the difference in distributions accross tasks. HIV therapy screening:  Can be modeled as multi-task learning problem.  Drug combinations (tasks) have similar but not identical effect on the virus.

Bickel, Bogojeska, Lengauer, Scheffer Overview Motivation.  HIV therapy screening.  Multiple tasks with differing distributions. Multi-task learning by distribution matching.  Problem Setting.  Density ratio matches pool to target distribution.  Discriminative estimation of matching weights. Case study:  HIV therapy screening.

Bickel, Bogojeska, Lengauer, Scheffer Multi-Task Learning – Problem Setting Target distribution Labeled target data

Bickel, Bogojeska, Lengauer, Scheffer Goal: Minimize loss under target distribution. Multi-Task Learning – Problem Setting Target distribution Labeled target data 

Bickel, Bogojeska, Lengauer, Scheffer Multi-Task Learning – Problem Setting Target distribution Labeled target data Goal: Minimize loss under target distribution. 

Bickel, Bogojeska, Lengauer, Scheffer Multi-Task Learning – Problem Setting Target distributionAuxiliary distributions Labeled target data Labeled auxiliary data Goal: Minimize loss under target distribution. 

Bickel, Bogojeska, Lengauer, Scheffer Multi-Task Learning – Problem Setting Target distributionAuxiliary distributions Labeled target data Labeled auxiliary data Problem Setting: Multi-Task Learning Goal: Minimize loss under target distribution. 

Bickel, Bogojeska, Lengauer, Scheffer Multi-Task Learning – Problem Setting Target distributionAuxiliary distributions Labeled target data Labeled auxiliary data Goal: Minimize loss under target distribution. 

Bickel, Bogojeska, Lengauer, Scheffer Multi-Task Learning Target distributionPool distribution ≠ Labeled target data Goal: Minimize loss under target distribution.

Bickel, Bogojeska, Lengauer, Scheffer Distribution Matching Target distributionPool distribution Labeled target data Goal: Minimize loss under target distribution. =

Bickel, Bogojeska, Lengauer, Scheffer Distribution Matching Target distributionPool distribution Labeled target data Goal: Minimize loss under target distribution. = Expected loss under target distribution Rescale loss for each pool example Expectation over training pool

Bickel, Bogojeska, Lengauer, Scheffer Distribution Matching Goal: Minimize loss under target distribution. = x y=−1y=−1 x y=+1 Target distributionPool distribution

Bickel, Bogojeska, Lengauer, Scheffer Distribution Matching Goal: Minimize loss under target distribution. = x y=−1y=−1 x y=+1 Target distributionPool distribution x1x

Bickel, Bogojeska, Lengauer, Scheffer Distribution Matching Goal: Minimize loss under target distribution. = x y=−1y=−1 x y=+1 Target distributionPool distribution x1x x2x2

Bickel, Bogojeska, Lengauer, Scheffer Estimation of Density Ratio Goal: Minimize loss under target distribution. =

Bickel, Bogojeska, Lengauer, Scheffer Estimation of Density Ratio Goal: Minimize loss under target distribution. Theorem: Potentially high- dimensional densities = One binary conditional density

Bickel, Bogojeska, Lengauer, Scheffer Estimation of Density Ratio Goal: Minimize loss under target distribution. Theorem: Intuition of : how much more likely is to be drawn from target than from auxiliary density. = Pool

Bickel, Bogojeska, Lengauer, Scheffer Estimation of Density Ratio Goal: Minimize loss under target distribution. Theorem: Intuition of : how much more likely is to be drawn from target than from auxiliary density. = Pool Target examples auxiliary task examples

Bickel, Bogojeska, Lengauer, Scheffer Estimation of Density Ratio Goal: Minimize loss under target distribution. Theorem: Intuition of : how much more likely is to be drawn from target than from auxiliary density. = Pool auxiliary task examples Estimation of with probabilistic classifier (e.g., logreg) Target examples

Bickel, Bogojeska, Lengauer, Scheffer Estimation of Density Ratio Goal: Minimize loss under target distribution. Theorem: Intuition of : how much more likely is to be drawn from target than from auxiliary density. = Pool Target examples auxiliary task examples towards blue larger large resampling weights

Bickel, Bogojeska, Lengauer, Scheffer Prior Knowledge on Task Similarity Prior knowledge in task similarity kernel. Encoding of prior knowledge in Gaussian prior on parameters v of a multi-class logistic regression model for the resampling weights. Main diagonal entries of set to (standard regularizer), Diagonals of sub-matrices set to.

Bickel, Bogojeska, Lengauer, Scheffer 1.Weight Model: Train Logreg of target vs. auxiliary data with task similarity in. 2.Target Model: Minimize regularized empirical loss on pool weighted by. Distribution Matching – Algorithm Result of step 1: weight model

Bickel, Bogojeska, Lengauer, Scheffer Overview Motivation.  HIV therapy screening.  Multiple tasks with differing distributions. Multi-task learning by distribution matching.  Problem Setting.  Density ratio matches pool to target distribution.  Discriminative estimation of matching weights. Case study:  HIV therapy screening.

Bickel, Bogojeska, Lengauer, Scheffer HIV Therapy Screening – Prediction Problem Information about each patient x, binary vector  of resistance-relevant virus mutations and  of previously given drugs. Drug combination selected out of 17 drugs.  Drug combinations correspond to tasks z. Target label y (success or failure of therapy).  2 different labelings (virus load and multi-conditional). virus load time conditions

Bickel, Bogojeska, Lengauer, Scheffer HIV Therapy Screening – Data Patients from hospitals in Italy, Germany, and Sweden.  3260 labeled treatments.  545 different drug combinations (tasks).  50% of combinations with only one labeled treatment. Similarity of drug combinations: task kernel.  Drug feature kernel: product of drug indicator vectors.  Mutation table kernel: similarity of mutations that render drug ineffective. 80/20 training/test split, consistent with time stamps. training data time test data

Bickel, Bogojeska, Lengauer, Scheffer Reference Methods Independent models (separately trained). One-size-fits-all, product of task and feature kernel,  Bonilla, Agakov, and Williams (2007). Hierarchical Bayesian Kernel,  Evgeniou & Pontil (2004). Hierarchical Bayesian Gaussian Process  Yu, Tresp, and Schwaighofer (2005). Logistic regression is target model (except for Gaussian process model). RBF kernels.

Bickel, Bogojeska, Lengauer, Scheffer Results – Distribution Matching vs. Other Distribution matching always best (17 of 20 cases stat. significant) or as good as best reference method. Improvement over separately trained models 10-14%. separate one-size- fits-all hier. Bayes kernel hier. Bayes Gauss. Proc. distribution matching virus load multi-condition

Bickel, Bogojeska, Lengauer, Scheffer Results – Benefit of Prior Knowledge The common prior knowledge on similarity of drug combinations does not improve accuracy of distribution matching. virus load multi-condition no prior knowledge drug. feat. kernel Mut. table kernel

Bickel, Bogojeska, Lengauer, Scheffer Conclusions Multi-task Learning:  Multiple problems with different distributions. Distribution matching:  Weighted pool distribution matches target distribution.  Discriminative estimation of weights with Logreg.  Training of target model with weighted loss terms. Case study: HIV therapy screening.  Distribution matching beats iid learning and hier. Bayes.  Benefit over separately trained models 10-14%.