Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning.

Similar presentations


Presentation on theme: "Machine Learning."— Presentation transcript:

1 Machine Learning

2 Example: Image Classification
2 Russakovsky et al., ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.

3 Example: Games

4 Example: Language Translation
4

5 Example: Tumor Subtypes
5

6 Example: Skin Cancer Diagnosis
6 Esteva et al., “Dermatologist-level classification of skin cancer with deep neural networks”, Nature. 2017

7 Unsupervised Learning
7 Finding the structure in data. Clustering Dimension reduction

8 Unsupervised Learning: Clustering
8 How many clusters? Where to set the borders between clusters? Need to select a distance measure. Examples of methods: k-means clustering Hierarchical clustering

9 Unsupervised Learning: Dimension Reduction
Examples of methods: Principal Component Analysis (PCA) t-Distributed Stochastic Neighbor Embedding (t-SNE) Independent Component Analysis (ICA) Non-Negative Matrix Factorization (NMF) Multi-Dimensional Scaling (MDS)

10 Linear Regression – one independent variable
10 Relationship: 𝑦= 𝑤 1 𝑥 1 + 𝑤 0 +𝜖 Data: 𝑦 𝑗 , 𝑥 1𝑗 for j=1..n Loss function: sum of squared errors: 𝐿= 𝑗 𝜖 𝑗 2 = 𝑗 𝑦 𝑗 − (𝑤 1 𝑥 1𝑗 + 𝑤 0 ) 2

11 Linear Regression – Error Landscape
Sum of Square Errors Slope

12 Linear Regression – Error Landscape
Slope Sum of Square Errors Intercept Slope

13 Linear Regression – Error Landscape
Slope Sum of Square Errors Intercept Slope

14 Minimizing the loss function:
Linear Regression – One Independent Variable 14 Minimizing the loss function: 𝜕𝐿 𝜕 𝑤 1 = 𝜕 𝜕 𝑤 1 𝑗 𝜖 𝑗 2 =0 𝜕𝐿 𝜕 𝑤 0 = 𝜕 𝜕 𝑤 0 𝑗 𝜖 𝑗 2 =0

15 Minimizing the loss function, L (sum of squared errors):
Linear Regression – One Independent Variable 15 Minimizing the loss function, L (sum of squared errors): 𝜕𝐿 𝜕 𝑤 1 = 𝜕 𝜕 𝑤 1 𝑗 𝜖 𝑗 2 = 𝜕 𝜕 𝑤 1 𝑗 𝑦 𝑗 − (𝑤 1 𝑥 1𝑗 + 𝑤 0 ) 2 =0 𝜕𝐿 𝜕 𝑤 0 = 𝜕 𝜕 𝑤 0 𝑗 𝜖 𝑗 2 = 𝜕 𝜕 𝑤 0 𝑗 𝑦 𝑗 − (𝑤 1 𝑥 1𝑗 + 𝑤 0 ) 2 =0

16 Model Capacity: Overfitting and Underfitting
16

17 Model Capacity: Overfitting and Underfitting
17

18 Model Capacity: Overfitting and Underfitting
18

19 Model Capacity: Overfitting and Underfitting
19 Training Error Error on Training Set Degree of polynomial

20 Model Capacity: Overfitting and Underfitting
20 With four parameters I can fit an elephant, and with five I can make him wiggle his trunk. John von Neumann

21 Training and Testing Data Set Test Training

22 Data Snooping 22 Do not use the test data for any purpose during training.

23 Training and Testing Testing Error Training Error
Error on Training Set Training Error Degree of polynomial

24 Training and Testing Testing Error Training Error
Error on Training Set Training Error Degree of polynomial

25 𝜕 𝜕 𝑤 𝑖 𝑗 𝑦 𝑗 − 𝑖 𝑤 𝑖 𝑓 𝑖 ( 𝒙 𝑗 ) 2 +𝜆 𝑖 𝑤 𝑖 2 =0
Regularization Linear regression: 25 𝜕 𝜕 𝑤 𝑖 𝑗 𝑦 𝑗 − 𝑖 𝑤 𝑖 𝑓 𝑖 ( 𝒙 𝑗 ) 2 =0 Regularized (L2) linear regression: 25 𝜕 𝜕 𝑤 𝑖 𝑗 𝑦 𝑗 − 𝑖 𝑤 𝑖 𝑓 𝑖 ( 𝒙 𝑗 ) 𝜆 𝑖 𝑤 𝑖 2 =0

26 Linear Regression - Regularization
Degree of polynomial = 9 10 100 1000 Coefficient Coefficient Coefficient

27 Supervised Learning: Classification
27

28 Supervised Learning: Classification
28

29 𝑦= 𝑤 1 𝑥 1 + 𝑤 0 +𝜖 𝑦=𝜎( 𝑤 1 𝑥 1 + 𝑤 0 +𝜖) Logistic Regression
Linear Regression: 𝑦= 𝑤 1 𝑥 1 + 𝑤 0 +𝜖 Logistic Regression: 𝑦=𝜎( 𝑤 1 𝑥 1 + 𝑤 0 +𝜖) where 𝜎(𝑡)= 1 1+ 𝑒 −𝑡 29 𝑤 1 =1 𝑤 1 =10

30 Sum of Square Errors as Loss Function
𝑤 1 𝑤 0

31 Sum of Square Errors as Loss Function
𝑤 1 𝑤 0

32 Sum of Square Errors as Loss Function
𝑤 1 𝑤 0 𝑤 0 𝑤 1

33 𝐿 𝒘 =log⁡( 𝑖=1 𝑛 𝜎 𝒙 𝑖 𝑦 𝑖 (1−𝜎( 𝒙 𝑖 )) 1−𝑦 𝑖 )=
Logistic Regression – Loss Function 𝐿 𝒘 =log⁡( 𝑖=1 𝑛 𝜎 𝒙 𝑖 𝑦 𝑖 (1−𝜎( 𝒙 𝑖 )) 1−𝑦 𝑖 )= 𝑖=1 𝑛 𝑦 𝑖 log 𝜎 𝒙 𝑖 + (1−𝑦 𝑖 ) log 1−𝜎 𝒙 𝑖 where 𝜎(𝑡)= 1 1+ 𝑒 −𝑡

34 Logistic Regression – Error Landscape
𝑤 1 𝑤 0

35 Logistic Regression – Error Landscape
𝑤 1 𝑤 0

36 Logistic Regression – Error Landscape
𝑤 1 𝑤 0 𝑤 0 𝑤 1

37 Training: Gradient Descent
37

38 Training: Gradient Descent
38

39 Training: Gradient Descent
39

40 Training: Gradient Descent
40

41 Training: Gradient Descent
41 We want to use a large training rate when we are far from the minimum and decrease it as we get closer.

42 Training: Gradient Descent
42 If the gradient is small in an extended region, gradient descent becomes very slow.

43 Training: Gradient Descent
43 Gradient descent can get stuck in local minima. To improve the behavior for shallow local minima, we can modify gradient descent to take the average of the gradient for the last few steps (similar to momentum and friction).

44 Linear Regression – Error Landscape
Sum of Square Errors

45 Linear Regression – Error Landscape
Sum of Square Errors

46 Linear Regression – Error Landscape
Sum of Absolute Errors

47 Linear Regression – Error Landscape

48 Gradient Descent

49 Gradient Descent

50 Gradient Descent

51 Gradient Descent

52 Gradient Descent

53 Linear Regression – Gradient Descent

54 Linear Regression – Gradient Descent

55 Linear Regression – Gradient Descent

56 Linear Regression – Gradient Descent

57 Gradient Descent

58 Gradient Descent – Learning Rate
Too Small Too Large

59 Gradient Descent – Learning Rate Decay
Constant Learning Rate Decaying Learning Rate

60 Partially Remembering
Gradient Descent – Unequal Gradients Constant Learning Rate Decaying Learning Rate Partially Remembering Previous Gradients

61 Gradient Descent Sum of Square Errors Sum of Absolute Errors

62 Outliers Sum of Square Errors Sum of Absolute Errors

63 Variable Variance

64 Evaluation of Binary Classification Models
Predicted True Negative False Positive 1 64 Actual False Negative True Positive True Positive Rate / Sensitivity / Recall = TP/(TP+FN) – fraction of label 1 predicted to be label 1 False Positive Rate = FP/(FP+TN) – fraction of label 0 predicted to be label 1 Accuracy = (TP+TN)/total - fraction of correct predictions Precision = TP/(TP+FP) – fraction of correct among positive predictions False discovery rate = 1 – precision Specificity = TN/(TN+FP) – fraction of correct predictions among label 0

65 Evaluation of Binary Classification Models
Label 0 Label 1 Label 0 Label 1 True Positives True Positives False Positives False Positives

66 Evaluation of Binary Classification Models
Label 0 Label 0 Label 1 Label 1 False Positives False Positives True Positives True Positives

67 Receiver Operator Characteristic (ROC)
Evaluation of Binary Classification Models False Positive Rate False Positive Rate False Positive Rate False Positive Rate True Positive Rate True Positive Rate True Positive Rate True Positive Rate Receiver Operator Characteristic (ROC) True Positive Rate True Positive Rate True Positive Rate True Positive Rate False Positive Rate False Positive Rate False Positive Rate False Positive Rate

68 Neural Networks w1 x1 𝑓( 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏) w2 x2 xn wn . Input Output
𝑓( 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏) 68 w2 x2 . xn wn Input Output Hidden

69 Generative Adversarial Networks
69 Nguyen et al., “Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space”,

70 Deep Dream Google DeepDream: The Garden of Earthly Delights
70 Google DeepDream: The Garden of Earthly Delights Hieronymus Bosch: The Garden of Earthly Delights

71 Artistic Style 71 LA. Gatys, A.S. Ecker, M. Bethge, “A Neural Algorithm of Artistic Style”,

72 Image Captioning – Combining CNNs and RNNs
72 Karpathy, Andrej & Fei-Fei, Li, "Deep visual-semantic alignments for generating image descriptions", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)

73 Training and Testing Data Set Test Training

74 Data Set Test Validation Training Validation: Choosing Hyperparameters
Examples of hyperparameters: Learning rate schedule Regularization parameter Number of nearest neighbors

75 Curse of Dimensionality
75 When the number of dimensions increase, the volume increases and the data becomes sparse. It is typical for biomedical data that there are few samples and many measurements.

76 No Free Lunch 76 Wolpert, David (1996), Neural Computation, pp

77 Can we trust the predictions of classifiers?
77 Ribeiro, Singh and Guestrin ,"Why Should I Trust You? Explaining the Predictions of Any Classifier“, In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2016

78 Adversarial Fooling Examples
Original correctly classified image Classified as ostrich Perturbation 78 Szegedy et al., “Intriguing properties of neural networks”,

79 Machine Learning


Download ppt "Machine Learning."

Similar presentations


Ads by Google