Machine Learning – Course Overview David Fenyő

Machine Learning – Course Overview David Fenyő
Contact:

Learning 2 “A computer program is said to learn from experience E with respect to some task T and performance measure P if its performance at task T, as measured by P, improves with experience E.” Mitchell 1997, Machine Learning.

Learning: Task Regression Classification Imputation Denoising
3 Regression Classification Imputation Denoising Transcription Translation Anomaly detection Synthesis Probability density estimation

Learning: Performance
4 Examples: Regression: sum of mean square errors Classification: cross-entropy

Learning: Experience Unsupervised Supervised Regression Classification
5 Unsupervised Supervised Regression Classification Reinforced

Example: Image Classification
6 Russakovsky et al., ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.

Example: Games

Example: Language Translation
8

Example: Tumor Subtypes
9

Example: Pathology and Radiology
10

Schedule 1/27 Course Overview 1/31 Unsupervised Learning: Clustering
2/3 Unsupervised Learning: Dimension Reduction 2/7 Unsupervised Learning: Clustering and Dimension Reduction Lab 2/10 Unsupervised Learning: Trajectory Analysis 2/14 Supervised Learning: Regression 2/17 Supervised Learning: Regression Lab 2/21 Supervised Learning: Classification 2/24 Supervised Learning: Classification Lab 2/28 Student Project Plan Presentation 3/3 Supervised Learning: Performance Estimation 3/7 Supervised Learning: Regularization 3/10 Supervised Learning: Performance Estimation and Regularization Lab 3/24 Neural Networks 3/28 Neural Networks Lab 3/31 Tree-Based Methods 4/4 Support Vector Machines 4/11 Tree-Based Methods and Support Vector Machines Lab 4/14 Probabilistic Graphical Models 4/18 Machine Learning Applied to Text Data 4/21 Machine Learning Applied to Clinical Data 4/25 Machine Learning Applied to Omics Data 5/2 Student Project Presentation 5/5 Student Project Presentation 11

Bayes Rule: How to Choose the Prior Probability?
Hypothesis (H) Data (D) P(H|D) = P(D|H) P(H) / P(D) Posterior Probability Prior Probability If we have no knowledge, we can assume that each outcome is equally probably. Two mutually exclusive hyposthesis H1 and H2: If we have no knowledge: P(H1) = P(H2) = 0.5 If we find out that hypothesis H2 is true: P(H1) = 0 and P(H2) = 1

P Ω = 𝑃 𝐻 𝑖 = 𝑃( 𝐻 𝑖 |𝐷) =1 Bayes Rule: Normalization Factor
Hypothesis (H) Data (D) P(H|D) = P(D|H) P(H) / P(D) Normalization Factor P Ω = 𝑃 𝐻 𝑖 = 𝑃( 𝐻 𝑖 |𝐷) =1

P(H|D) = P(D|H) P(H) / P(D) Posterior Probability Prior Probability
Bayes Rule: More Data Bayes Rule Hypothesis (H) Data (D) P(H|D) = P(D|H) P(H) / P(D) Posterior Probability Prior Probability Two mutually exclusive hypothesis H1 and H2 (Priors: P(H1) = P(H2) = 0.5): P(H2|D1) = P(D1|H2) P(H2) / P(D1) = 0.7 (P(H2) = 0.5, P(D1|H2) / P(D1)=1.4) P(H2|D1,D2) = P(D2|H2) P(H2|D1) / P(D2) = 0.88 (P(H2|D1) = 0.7, P(D2|H2) / P(D2)=1.26) P(H2|D1,D2,D3) = P(D3|H2) P(H2|D1,D2) / P(D3) = 1 (P(H2|D1,D2) = 0.5, P(D3|H2) / P(D3)=1.14)

𝐸𝑛𝑡𝑟𝑜𝑝𝑦=− 𝑝 𝑖 𝑙𝑜𝑔 2 ( 𝑝 𝑖 ) Bayes Rule and Information Theory
𝐸𝑛𝑡𝑟𝑜𝑝𝑦=− 𝑝 𝑖 𝑙𝑜𝑔 2 ( 𝑝 𝑖 ) Two mutually exclusive hypothesis H1 and H2: If we have no knowledge: P(H1) = P(H2) = 0.5: Entropy=1 If hypothesis H2 is true: P(H1) = 0 and P(H2) = 1 : Entropy=0 P(H1) = 0.3, P(H2) = 0.7: Entropy=0.88 P(H1) = 0.11, P(H2) = 0.89: Entropy=0.50

Bayes Rule: Example: What is the bias of a coin?
Hypothesis (H) Data (D) P(H|D) = P(D|H) P(H) / P(D) Hypothesis: the probability for head is θ (=0.5 for unbiased coin) Data: 10 flips of a coin: 3 heads and 7 tails. P(D|θ) = θ3(1- θ)7 Uninformative prior: P(θ) uniform Posterior Likelihood Prior

Hypothesis (H) Data (D) P(H|D) = P(D|H) P(H) / P(D) Hypothesis: the probability for head is θ (=0.5 for unbiased coin) Data: 10 flips of a coin: 3 heads and 7 tails. Likelihood: P(D|θ) = θ3(1- θ)7 Prior: θ2(1- θ)2 Posterior Likelihood Prior θ θ θ

Posterior Probability Data: 10 flips of a coin: 3 heads and 7 tails. 100 flips of a coin: 45 heads and 55 tails. 1000 flips of a coin: 515 heads and 485 tails. Prior: θ2(1- θ)2 Uniform prior θ

DREAM Challenges

Crowdsourcing Crowdsourcing is a methodology that uses the voluntary help of large communities to solve problems posed by an organization Coined in 2006, but not new: in 1714 British Board of Longitude Prize: who can determine a ship’s longitude at sea? (winner: John Harrison, unknown clock-maker) Different types of crowdsourcing: Citizen science: the crowd provides data (e.g., patients) Labor-focused crowdsourcing: online workforce, tasks for money Gamification: encode problem as game Collaborative competitions (challenges) Julio Saez-Rodriguez: RWTH-Aachen&EMBL EBI

Collaborative competitions (challenges)
Post a question to whole scientific community, withhold the answer (‘gold standard’) Evaluate submissions against the gold-standard with appropriate scoring Analyze results Design Open Challenge Scoring Challenge Train Test Pose to the Community Julio Saez-Rodriguez: RWTH-Aachen&EMBL EBI

Examples of DREAM challenges
Predict phosphoproteomic data and infer signalling network - upon perturbation with ligands and drugs (Prill et al Science Signaling. 2011; Hill et al, Nature Meth 2016) Predict Transcription Factor Binding Sites - (with ENCODE; ongoing) Molecular Classification of Acute Myeloid Leukaemia - from patient samples using flow cytometry data - with FlowCAP (Aghaeepour et al Nat Meth 2013) Predict progression of Amyotrophic lateral sclerosis patients - from clinical trial data (Kuffner et al Nature Biotech 2015) NCI-DREAM Drug Sensitivity Prediction- predict response of breast cancer cell lines to single (Costello et al Nat Biot 2014) and combined (Bansal et al Nat Biot 2014) drugs The AstraZeneca-Sanger DREAM synergy prediction challenge - predict drug combinations on cancer cell lines from molecular data (just finished)  The NIEHS-NCATS-UNC DREAM Toxicogenetics - predict toxicity of chemical compounds (Eduati et al., Nat Biot, 2015)

NCI-DREAM Drug sensitivity challenge
Costello et al. Nat Biotech. 2015

Some lessons from the drug sensitivity challenge
Some drugs are easier to predict than others, & does not depend on mode of action Gene Expression is the most predictive data type Integration of multiple data and pathway information layers improves predictivity Costello et al. Nat Biotech. 2015

Some lessons from the drug sensitivity challenge
0.60 RANDOM Gene expression & protein amount - the most predictive data type Integration of multiple data and pathway information improves predictivity There is plenty of room for improvement The wisdom of the crowds: Aggregate is robust Costello et al. Nat Biotech. 2015

Value of collaborative competitions (challenges)
Challenge-based evaluation of methods is unbiased & enhances reproducibility Discover the Best Methods Determine the solvability of a scientific question Sampling of the space of methods Understand the diversity of methodologies used to solve a problem Acceleration of Research The community of participants can do in 4 months what would take 10 years to any group Community Building Make high quality, well-annotated data accessible. Foster community collaborations on fundamental research questions. Determine robust solutions through community consensus: “The Wisdom of Crowds.” Julio Saez-Rodriguez: RWTH-Aachen&EMBL EBI

Class Project Pick one of the previous DREAM Challenges and analyze the data using several different methods. 2/28 Project Plan Presentation 5/2 Project Presentation 5/5 Project Presentation

Class Presentations Pick one ongoing DREAM or biomedicine related Kaggle challenge to preset during one of the next classes. Kaggle

Curse of Dimensionality
31 When the number of dimensions increase, the volume increases and the data becomes sparse. It is typical for biomedical data that there are few samples and many measurements.

Unsupervised Learning
32 Finding the structure in data. Clustering Dimension reduction

Unsupervised Learning: Clustering
33 How many clusters? Where to set the borders between clusters? Need to select a distance measure. Examples of methods: k-means clustering Hierarchical clustering

Unsupervised Learning: Dimension Reduction
Examples of methods: Principal Component Analysis (PCA) t-Distributed Stochastic Neighbor Embedding (t-SNE) Independent Component Analysis (ICA) Non-Negative Matrix Factorization (NMF) Multi-Dimensional Scaling (MDS)

Supervised Learning: Regression
35 Choose a function, f(x,w) where and a performance metric, 𝑗 𝑔 𝑦 𝑗 −𝑓( 𝒙 𝑗 ,𝒘) to minimize where ( 𝑦 𝑗 , 𝒙 𝑗 ) is the training data and w = (w1 ,w2,…, wk) are the k parameters.Commonly, f is a linear function of w, and g is the sum of mean square errors: 𝜕 𝜕 𝑤 𝑖 𝑗 𝑦 𝑗 − 𝑖 𝑤 𝑖 𝑓 𝑖 ( 𝒙 𝑗 ) 2 =0 𝑓 𝒙,𝒘 = 𝑖 𝑤 𝑖 𝑓 𝑖 (𝒙)

Model Capacity: Overfitting and Underfitting
36

37

38

39 Training Error Error on Training Set Degree of polynomial

40 With four parameters I can fit an elephant, and with five I can make him wiggle his trunk. John von Neumann

Training and Testing Data Set Test Training

Training and Testing Testing Error Training Error
Error on Training Set Training Error Degree of polynomial

𝜕 𝜕 𝑤 𝑖 𝑗 𝑦 𝑗 − 𝑖 𝑤 𝑖 𝑓 𝑖 ( 𝒙 𝑗 ) 2 +𝜆 𝑖 𝑤 𝑖 2 =0
Regularization Linear regression: 44 𝜕 𝜕 𝑤 𝑖 𝑗 𝑦 𝑗 − 𝑖 𝑤 𝑖 𝑓 𝑖 ( 𝒙 𝑗 ) 2 =0 Regularized (L2) linear regression: 44 𝜕 𝜕 𝑤 𝑖 𝑗 𝑦 𝑗 − 𝑖 𝑤 𝑖 𝑓 𝑖 ( 𝒙 𝑗 ) 𝜆 𝑖 𝑤 𝑖 2 =0

Supervised Learning: Classification
45

Supervised Learning: Classification
46

Evaluation of Binary Classification Models
Predicted True Negative False Positive 1 47 Actual False Negative True Positive False Positive Rate = FP/(FP+TN) – fraction of label 0 predicted to be label 1 Accuracy = (TP+TN)/total - fraction of correct prediction Precision = TP/(TP+FP) – fraction of correct among positive predictions Sensitivity = TP/(TP+FN) – fraction of correct predictions among label 1. Also called true positive rate and recall. Specificity = TN/(TN+FP) – fraction of correct predictions among label 0

Evaluation of Binary Classification Models
Receiver Operating Characteristic (ROC) 48 Algorithm 1 False False Sensitivity Sensitivity True True Score Score 1 1 - - Specificity Specificity Algorithm 2 False False Sensitivity Sensitivity True True Score Score 1 1 - - Specificity Specificity

Training: Gradient Descent
49

50

51

52

53 We want to use a large training rate when we are far from the minimum and decrease it as we get closer.

54 If the gradient is small in an extended region, gradient descent becomes very slow.

55 Gradient descent can get stuck in local minima. To improve the behavior for shallow local minima, we can modify gradient descent to take the average of the gradient for the last few steps (similar to momentum and friction).

Validation: Choosing Hyperparameters
Data Set Test Training

Data Set Test Validation Training Validation: Choosing Hyperparameters
Examples of hyperparameters: Learning rate Regularization parameter

Data Set Test Training Cross-Validation Training 1 Validation 1
58 Data Set Test Training Training 1 Validation 1 Training 2 Validation 2 Training 3 Validation 3 Training 4 Validation4

Preparing Data Cleaning the data Handling missing data
59 Cleaning the data Handling missing data Transforming data

Missing Data Missing completely at random Missing at random
60 Missing completely at random Missing at random Missing not at random

Missing Data 61 Discarding samples or measurements containing missing values Imputing missing values

Sampling Bias 62

Sampling Bias 63 DF Ransohoff, "Bias as a threat to the validity of cancer molecular-marker research", Nat Rev Cancer 5 (2005)

Data Snooping 64 Do not use the test data for any purpose during training.

Data Snooping 65

No Free Lunch 66 Wolpert, David (1996), Neural Computation, pp

Can we trust the predictions of classifiers?
67 Ribeiro, Singh and Guestrin ,"Why Should I Trust You? Explaining the Predictions of Any Classifier“, In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2016

Adversarial Fooling Examples
Original correctly classified image Classified as ostrich Perturbation 68 Szegedy et al., “Intriguing properties of neural networks”,

Schedule 1/27 Course Overview 1/31 Unsupervised Learning: Clustering
2/3 Unsupervised Learning: Dimension Reduction 2/7 Unsupervised Learning: Clustering and Dimension Reduction Lab 2/10 Unsupervised Learning: Trajectory Analysis 2/14 Supervised Learning: Regression 2/17 Supervised Learning: Regression Lab 2/21 Supervised Learning: Classification 2/24 Supervised Learning: Classification Lab 2/28 Student Project Plan Presentation 3/3 Supervised Learning: Performance Estimation 3/7 Supervised Learning: Regularization 3/10 Supervised Learning: Performance Estimation and Regularization Lab 3/24 Neural Networks 3/28 Neural Networks Lab 3/31 Tree-Based Methods 4/4 Support Vector Machines 4/11 Tree-Based Methods and Support Vector Machines Lab 4/14 Probabilistic Graphical Models 4/18 Machine Learning Applied to Text Data 4/21 Machine Learning Applied to Clinical Data 4/25 Machine Learning Applied to Omics Data 5/2 Student Project Presentation 5/5 Student Project Presentation 69

Home Work Read Saez-Rodriguez el al., Crowdsourcing biomedical research: leveraging communities as innovation engines. Nat Rev Genet Jul 15;17(8): doi: /nrg PubMed PMID: Pick one of the previous DREAM Challenges and analyze the data using several different methods. Pick one ongoing DREAM or biomedicine related Kaggle challenge to preset during one of the next classes. 70

Machine Learning – Course Overview David Fenyő

Similar presentations

Presentation on theme: "Machine Learning – Course Overview David Fenyő"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning – Course Overview David Fenyő

Similar presentations

Presentation on theme: "Machine Learning – Course Overview David Fenyő"— Presentation transcript:

Similar presentations

About project

Feedback