Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Similar presentations


Presentation on theme: "Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA."— Presentation transcript:

1 Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA

2 Introduction to Machine Learning

3 Biological Sequences Analysis, MTA 3 of 39 Some cool examples Introduction

4 Biological Sequences Analysis, MTA 4 of 39 Types of learnings Supervised learning - using "labeled" examples of input and desired output. Unsupervised learning - Models a set of inputs: labeled examples are not available. Reinforcement learning - Feedback on the actions from observing the environment (maximizing long term reward) Introduction

5 Clustering

6 Biological Sequences Analysis, MTA 6 of 39 Clustering definition Input: a set of instances Output: subsets (called clusters) so that observations in the same cluster are similar. Is it supervised or not?What does similar mean? Clustering

7 Biological Sequences Analysis, MTA 7 of 39 K-means clustering 0. Choose number of clusters (k) 1. Initiation: randomly generate k centers 2. Assignment of each point to nearest cluster center: Clustering

8 Biological Sequences Analysis, MTA 8 of 39 K-means clustering 0. Choose number of clusters (k) 1. Initiation: randomly generate k centers 2. Assignment of each point to nearest cluster center 3. Update location of centers: Clustering

9 Biological Sequences Analysis, MTA 9 of 39 K-means clustering 0. Choose number of clusters (k) 1. Initiation: randomly generate k centers 2. Assignment of each point to nearest cluster center 3. Update location of centers 4. Repeat 2-3 until no further change K-means - Interactive demo Clustering

10 Biological Sequences Analysis, MTA 10 of 39 Other clustering algorithms Take into account: homogeneity: similarity of instances inside a cluster. separation: dissimilarity of instances of different clusters. Allow "fuzzy clustering": instances bleongs to more than one cluster. Hierarchal clustering Clustering

11 Biological Sequences Analysis, MTA 11 of 39 Hierarchical clustering 1234512345 C1 C2 C3 C4 C5 C6.. Raw table Hierarchical clustering Cluster criterion Scores Similarity matrix Similarity criterion 1234512345 Clustering

12 Biological Sequences Analysis, MTA 12 of 39 UPGMA (you should already know it…) Neighbor-joining Hierarchical clustering 1234512345 C1 C2 C3 C4 C5 C6.. Cluster criterion Scores Similarity criterion 1234512345 A C B D E D A D (C,B) A E ((C,B),E) Clustering

13 Biological Sequences Analysis, MTA 13 of 39 Wait a minute… A tree is clustering?! Hierarchical clustering Clustering

14 Classifying

15 Biological Sequences Analysis, MTA 15 of 39 What is classification Input: labeled training set and unlabeled data set. Learn classifying (assigning labels), according to the features of the training set Output: labels on the data set. Example: qualified boy/girlfriend Classifying

16 Biological Sequences Analysis, MTA 16 of 39 Where to draw the line?!?! 0 10 20 30 40 50 60 70 80 90 100 0123456 X Y UnqualifiedQualified Classifying

17 Biological Sequences Analysis, MTA 17 of 39 Where to draw the line?!?! 0 10 20 30 40 50 60 70 80 90 100 0123456 X Y UnqualifiedQualified Classifying

18 Biological Sequences Analysis, MTA 18 of 39 Where to draw the line?!?! 0 10 20 30 40 50 60 70 80 90 100 0123456 X Y UnqualifiedQualified Classifying

19 Biological Sequences Analysis, MTA 19 of 39 Where to draw the line?!?! 0 10 20 30 40 50 60 70 80 90 100 0123456 X Y UnqualifiedQualified Classifying

20 Biological Sequences Analysis, MTA 20 of 39 Where to draw the line?!?! 0 10 20 30 40 50 60 70 80 90 100 0123456 X Y UnqualifiedQualified Now consider dozens of features… Classifying

21 Biological Sequences Analysis, MTA 21 of 39 How to classify KNN (K Nearest Neighbors) Decision trees SVM (Support Vector Machine) Naïve Bayes Baysian Networks NN (Neural Networks) Many many more… Classifying

22 Biological Sequences Analysis, MTA 22 of 39 KNN (K Nearest Neighbors) 0 10 20 30 40 50 60 70 80 90 100 0123456 X Y Lazy (no pre-processing) Local Can deal with complex patterns Classifying

23 Biological Sequences Analysis, MTA 23 of 39 Decision trees 0 10 20 30 40 50 60 70 80 90 100 0123456 X Y X ≥ 1.7 Y ≥ 36 X < 1.7 ? ? Y < 36 Tree actually means something! Can deal with complex patterns Classifying

24 Biological Sequences Analysis, MTA 24 of 39 SVM (Support Vector Machine) Classifying

25 Biological Sequences Analysis, MTA 25 of 39 SVM (Support Vector Machine) Finds optimal linear separation Maximizes the margin between the two data sets Can use transformation to higher dimension when not linear separable. Classifying

26 Biological Sequences Analysis, MTA 26 of 39 Naïve Bayse X P P( |X) and Can easily compute: P( |Y) and Can do the same for: Classifying Score( ) = P( |X,Y) Score( ) = P( |X) · P( |Y) Score( ) = P( |X,Y) Score( ) = P( |X) · P( |Y)

27 Biological Sequences Analysis, MTA 27 of 39 Naïve Bayse – graphical representation P( |X)P( |Y) XYZ P( |Z) Score( ) = P( |X,Y,Z) = P( |X)· P( |Y) · P( |Z) What if there are dependencies?? Classifying

28 Biological Sequences Analysis, MTA 28 of 39 Baysian Network P( |X,Z)P( |Y) XY Z P( X|Z) Score( ) = P( |X,Y,Z) = P( |X,Z) · P( |Y) Baysian Network takes dependencies into account Classifying

29 Biological Sequences Analysis, MTA 29 of 39 Use a labeled test set (in addition to the training set) Cross validation: 10-fold Leave-one-out How to choose a classifier (estimate performances)? Classifying

30 Legionalla pneumophila case-study

31 Biological Sequences Analysis, MTA 31 of 39 How did it all begin? Legionella pneumophila

32 Biological Sequences Analysis, MTA 32 of 39 Legionnaire disease nowadays Legionella pneumophila

33 Biological Sequences Analysis, MTA 33 of 39 Legionella pneumophila Copyright © 2005 Nature Publishing Group. Created by Arkitek from Nature Reviews Microbiology

34 Biological Sequences Analysis, MTA 34 of 39 Identifying the effectors Legionella pneumophila

35 Biological Sequences Analysis, MTA 35 of 39 Homology to host proteins Regulatory elements Genome proximity to other effectors Secretion signal Abundance in Metazoa / Bacteria GC content Sequence homology The features Legionella pneumophila

36 Biological Sequences Analysis, MTA 36 of 39 The effectors machine 5 Legionella pneumophila

37 Biological Sequences Analysis, MTA 37 of 39 The big picture Similarity to known effectors Regulatory elements Features Similarity to host proteins G-C content Secretory signals Feature selection NN SVM Naïve Bayes Bayesian Net Voting Classification algorithms Experimental validation Predicted effectors Prior knowledge Trained model Unclassified genes Predicted non-effectors Newly validated effectors Non- effectors Validated effectors Abundance in Metazoa\Bacteria Genome arrangement Legionella pneumophila

38 Biological Sequences Analysis, MTA 38 of 39 Does it really work?? Machine learning

39 Biological Sequences Analysis, MTA 39 of 39


Download ppt "Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA."

Similar presentations


Ads by Google