Presentation is loading. Please wait.

Presentation is loading. Please wait.

MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS

Similar presentations


Presentation on theme: "MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS"— Presentation transcript:

1 MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS
Elena Marchiori IBIVU Vrije Universiteit Amsterdam

2 Summary Machine Learning Supervised Learning: classification
Unsupervised Learning: clustering prova

3 Machine Learning (ML) ? ML Computational model
Construct a computational model from a dataset describing properties of an unknown (but existent) system. observations System (unknown) properties ? prova ML Computational model prediction

4 Supervised Learning The dataset describes examples of input-output behaviour of a unknown (but existent) system. The algorithm tries to find a function ‘equivalent’ to the system. ML techniques for classification: K-nearest neighbour, decision trees, Naïve Bayes, Support Vector Machines. prova

5 ? Supervised Learning model property of interest observations
System (unknown) supervisor Training data ? ML algorithm new observation model prediction Unsupervised learning

6 Example: A Classification Problem
Categorize images of fish—say, “Atlantic salmon” vs. “Pacific salmon” Use features such as length, width, lightness, fin shape & number, mouth position, etc. Steps Preprocessing (e.g., background subtraction) Feature extraction Classification example from Duda & Hart

7 Classification in Bioinformatics
Computational diagnostic: early cancer detection Tumor biomarker discovery Protein folding prediction Protein-protein binding sites prediction From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

8 Classification Techniques
Naïve Bayes K Nearest Neighbour Support Vector Machines (next lesson) From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

9 Bayesian Approach Each observed training example can incrementally decrease or increase probability of hypothesis instead of eliminate an hypothesis Prior knowledge can be combined with observed data to determine hypothesis Bayesian methods can accommodate hypotheses that make probabilistic predictions New instances can be classified by combining the predictions of multiple hypotheses, weighted by their probabilities Kathleen McKeown’s slides

10 Bayesian Approach Assign the most probable target value, given <a1,a2,…an> VMAP=argmax P(vj| a1,a2,…an) Using Bayes Theorem: VMAP=argmax P(a1,a2,…an|vj)P(vi) vjV P(a1,a2,…an) =argmax P(a1,a2,…an|vj)P(vi) vjV Bayesian learning is optimal Easy to estimate P(vi) by counting in training data Estimating the different P(a1,a2,…an|vj) not feasible (we would need a training set of size proportional to the number of possible instances times the number of classes) Kathleen McKeown’s slides

11 Bayes’ Rules Product Rule P(a Λ b) = P(a|b)P(b)= P(b|a)P(a)
Bayes’ rule P(a|b)=P(b|a)P(a) P(b) In distribution form: P(Y|X)=P(X|Y)P(Y) = αP(X|Y)P(Y) P(X) Kathleen McKeown’s slides

12 Naïve Bayes Assume independence of attributes
P(a1,a2,…an|vj)=∏P(ai|vj) i Substitute into VMAP formula VNB=argmax P(vj)∏P(ai|vj) vjV i Kathleen McKeown’s slides

13 VNB=argmax P(vj)∏P(ai|vj) vjV
S-length S-width P-length Class 1 high Versicolour 2 low Setosa 3 Verginica 4 med 5 6 7 8 9 Kathleen McKeown’s slides

14 Estimating Probabilities
What happens when the number of data elements is small? Suppose true P(S-length=low|verginica)=.05 There are only 2 instances with C=Verginica We estimate probability by nc/n using the training set Then #S-length =low |Verginica must = 0 Then, instead of .05 we use estimated probability of 0 Two problems Biased underestimate of probability This probability term will dominate if future query contains S-length=low Kathleen McKeown’s slides

15 Instead: use m-estimate
Use priors as well nc+mp n+m Where p = prior estimate of P(S-length=low|verginica) m is a constant called the equivalent sample size Determines how heavily to weight p relative to the observed data Typical method: assume a uniform prior of an attribute (e.g. if values low,med,high -> p =1/3) Kathleen McKeown’s slides

16 K-Nearest Neighbour Memorize the training data
Given a new example, find its k nearest neighbours, and output the majority vote class. Choices: How many neighbours? What distance measure? prova

17 Application in Bioinformatics
A Regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data, Z. Yao and W.L. Ruzzo, BMC Bioinformatics 2006, 7 For each dataset k, for each pair of genes p compute similarity fk(p) of p wrt k-th data Construct predictor of gene pair similarity, e.g. logistic regression H: f(p,1),…,f(p,m)  H(f(p,1),…,f(p,m)) such that H high value if genes of p have similar functions. Given a new gene g find kNN using H as distance Predict the functional classes C1, .., Cn of g with confidence equal to Confidence(Ci) = 1- Π (1- Pij) with gj neighbour of g and Ci in the set of classes of gj (probability that at least one prediction is correct, that is 1 – probability that all predictions are wrong) prova

18 Classification: CV error
N samples Training error Empirical error Error on independent test set Test error Cross validation (CV) error Leave-one-out (LOO) N-fold CV splitting 1/n samples for testing N-1/n samples for training Count errors Summarize CV error rate Supervised learning

19 Two schemes of cross validation
CV1 CV2 N samples N samples LOO Gene selection Train and test the gene-selector and the classifier LOO Train and test the classifier Count errors Count errors Supervised learning

20 Difference between CV1 and CV2
CV1 gene selection within LOOCV CV2 gene selection before before LOOCV CV2 can yield optimistic estimation of classification true error CV2 used in paper by Golub et al. : 0 training error 2 CV error (5.26%) 5 test error (14.7%) CV error different from test error! Supervised learning

21 Significance of classification results
Permutation test: Permute class label of samples LOOCV error on data with permuted labels Repeat process a high number of times Compare with LOOCV error on original data: P-value = (# times LOOCV on permuted data <= LOOCV on original data) / total # of permutations considered Supervised learning

22 Unsupervised Learning
ML for unsupervised learning attempts to discover interesting structure in the available data Unsupervised learning

23 Unsupervised Learning
The dataset describes the structure of an unknown (but existent) system. The computer program tries to identify structure of the system (clustering, data compression). ML techniques: hierarchical clustering, k-means, Self Organizing Maps (SOM), fuzzy clustering (described in a future lesson). prova

24 Clustering Clustering is one of the most important unsupervised learning processes for organizing objects into groups whose members are similar in some way. Clustering finds structures in a collection of unlabeled data. A cluster is a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters.

25 Clustering Algorithms
Start with a collection of n objects each represented by a p–dimensional feature vector xi , i=1, …n. The goal is to associatethe n objects to k clusters so that objects “within” a clusters are more “similar” than objects between clusters. k is usually unknown. Popular methods: hierarchical, k-means, SOM, … From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

26 Hierarchical Clustering
Venn Diagram of Clustered Data Dendrogram From

27 Hierarchical Clustering (Cont.)
Multilevel clustering: level 1 has n clusters  level n has one cluster. Agglomerative HC: starts with singleton and merge clusters. Divisive HC: starts with one sample and split clusters.

28 Nearest Neighbor Algorithm
Nearest Neighbor Algorithm is an agglomerative approach (bottom-up). Starts with n nodes (n is the size of our sample), merges the 2 most similar nodes at each step, and stops when the desired number of clusters is reached. From

29 Nearest Neighbor, Level 2, k = 7 clusters.
From

30 Nearest Neighbor, Level 3, k = 6 clusters.

31 Nearest Neighbor, Level 4, k = 5 clusters.

32 Nearest Neighbor, Level 5, k = 4 clusters.

33 Nearest Neighbor, Level 6, k = 3 clusters.

34 Nearest Neighbor, Level 7, k = 2 clusters.

35 Nearest Neighbor, Level 8, k = 1 cluster.

36 Hierarchical Clustering
Calculate the similarity between all possible combinations of two profiles Keys Similarity Clustering Two most similar clusters are grouped together to form a new cluster Calculate the similarity between the new cluster and all remaining clusters. From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

37 Clustering in Bioinformatics
Microarray data quality checking Does replicates cluster together? Does similar conditions, time points, tissue types cluster together? Cluster genes  Prediction of functions of unknown genes by known ones Cluster samples  Discover clinical characteristics (e.g. survival, marker status) shared by samples. Promoter analysis of commonly regulated genes From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

38 Functional significant gene clusters
Two-way clustering Sample clusters Gene clusters

39 Bhattacharjee et al. (2001) Human lung carcinomas mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad. Sci. USA, Vol. 98,

40 Similarity Measurements
Pearson Correlation Two profiles (vectors) and +1  Pearson Correlation  – 1 From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

41 Similarity Measurements
Pearson Correlation: Trend Similarity From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

42 Similarity Measurements
Euclidean Distance From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

43 Similarity Measurements
Euclidean Distance: Absolute difference From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

44 Clustering C1 Merge which pair of clusters? C2 C3
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

45 Clustering Single Linkage
Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters + + C2 C1 Tend to generate “long chains” From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

46 Clustering Complete Linkage
Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters + + C2 C1 Tend to generate “clumps” From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

47 Clustering Average Linkage
Dissimilarity between two clusters = Averaged distances of all pairs of objects (one from each cluster). + + C2 C1 From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

48 Clustering Average Group Linkage
Dissimilarity between two clusters = Distance between two cluster means. + + C2 C1 From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

49 Considerations What genes are used to cluster samples?
Expression variation Inherent variation Prior knowledge (irrelevant genes) Etc. From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

50 K-means Clustering Initialize the K cluster representatives w’s, e.g. to randomly chosen examples. Assign each input example x to the cluster c(x) with the nearest corresponding weight vector: Update the weights: Increment n by 1 and go until no noticeable changes of the cluster representatives occur. Unsupervised learning

51 Example I Initial Data and Seeds Final Clustering
Unsupervised learning

52 Example II Initial Data and Seeds Final Clustering
Unsupervised learning

53 SOM: Brain’s self-organization
The brain maps the external multidimensional representation of the world into a similar 1 or 2 - dimensional internal representation. That is, the brain processes the external signals in a topology-preserving way Mimicking the way the brain learns, our clustering system should be able to do the same thing. Unsupervised learning

54 Self-Organized Map: idea
Data: vectors XT = (X1, ... Xd) from d-dimensional space. Grid of nodes, with local processor (called neuron) in each node. Local processor # j has d adaptive parameters W(j). Goal: change W(j) parameters to recover data clusters in X space. Unsupervised learning

55 Training process Java demos: Unsupervised learning

56 Concept of the SOM Input space Reduced feature space
Ba s1 s2 Mn Sr Cluster centers (code vectors) Place of these code vectors in the reduced space Clustering and ordering of the cluster centers in a two dimensional grid Unsupervised learning

57 Concept of the SOM SA3 SA3 Ba Mn Sr … Mg Unsupervised learning
We can use it for visualization Ba Mn SA3 We can use it for classification Sr SA3 Mg We can use it for clustering Unsupervised learning

58 SOM: learning algorithm
Initialization. n=0. Choose random small values for weight vectors components. Sampling. Select an x from the input examples. Similarity matching. Find the winning neuron i(x) at iteration n: Updating: adjust the weight vectors of all neurons using the following rule Continuation: n = n+1. Go to the Sampling step until no noticeable changes in the weights are observed. Unsupervised learning

59 Neighborhood Function
Gaussian neighborhood function: dji: lateral distance of neurons i and j in a 1-dimensional lattice | j - i | in a 2-dimensional lattice || rj - ri || where rj is the position of neuron j in the lattice. Unsupervised learning

60 Initial h function (Example )
Unsupervised learning

61 Some examples of real-life applications
Helsinki University of Technology web site Contains > 5000 papers on SOM and its applications: Brain research: modeling of formation of various topographical maps in motor, auditory, visual and somatotopic areas. Clusterization of genes, protein properties, chemical compounds, speech phonemes, sounds of birds and insects, astronomical objects, economical data, business and financial data .... Data compression (images and audio), information filtering. Medical and technical diagnostics. Unsupervised learning

62 Issues in Clustering How many clusters? What similarity measure?
User parameter Use model selection criteria (Bayesian Information Criterion) with penalization term which considers model complexity. See e.g. X-means: What similarity measure? Euclidean distance Correlation coefficient Ad-hoc similarity measures Unsupervised learning

63 Validation of clustering results
External measures According to some external knowledge Consideration of bias and subjectivity Internal measures Quality of clusters according to the data Compactness and separation Stability See e.g. J.Handl, J.Knowles, D.B.Kell Computational cluster validation in postgenomic data analysis, Bioinformatics, 21(15): , 2005 Unsupervised learning

64 T.R. Golub et al., Science 286, 531 (1999)
Bioinformatics Application T.R. Golub et al., Science 286, 531 (1999) Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring Unsupervised learning

65 Identification of cancer types
Why is Identification of Cancer Class (tumor sub-type) important? Cancers of Identical grade can have widely variable clinical courses (i.e. acute lymphoblastic leukemia, or Acute myeloid leukemia). Tradition Method: Morphological appearance. Enzyme-based histochemical analyses. Immunophenotyping. Cytogenetic analysis. ALL: Corticosteroids, vincristine, methotrexate, and L-asparaginase AML: Daunorubicin and cytarabine. Chemotherapy is highly toxic. Very accurate. Problem with Traditional method: each performed in a separate, highly specialized laboratory. Golub et al 1999 Unsupervised learning

66 Class Prediction How could one use an initial collection of samples belonging to know classes to create a class Predictor? Identification of Informative Genes Weighted Vote Golub et al slides Unsupervised learning

67 Data Initial Sample: 38 Bone Marrow Samples (27 ALL, 11 AML) obtained at the time of diagnosis. Independent Sample: 34 leukemia consisted of 24 bone marrow and 10 peripheral blood samples (20 ALL and 14 AML). Affy, 6817 genes. Golub et al slides Unsupervised learning

68 Validation of Gene Voting
Initial Samples: 36 of the 38 samples as either AML or ALL and two as uncertain. All 36 samples agrees with clinical diagnosis. Independent Samples: 29 of 34 samples are strongly predicted with 100% accuracy. Golub et al slides Unsupervised learning

69 Class Discovery Can cancer classes be discovered automatically based on gene expression? Cluster tumors by gene expression Determine whether the putative classes produced are meaningful. Golub et al slides Unsupervised learning

70 Cluster tumors Self-organization Map (SOM)
Mathematical cluster analysis for recognizing and clasifying feautres in complex, multidimensional data (similar to K-mean approach) Chooses a geometry of “nodes” Nodes are mapped into K-dimensional space, initially at random. Iteratively adjust the nodes. Golub et al slides Unsupervised learning

71 Validation of SOM Prediction based on cluster A1 and A2:
24/25 of the ALL samples from initial dataset were clustered in group A1 10/13 of the AML samples from initial dataset were clustered in group A2 Golub et al slides Unsupervised learning

72 Validation of SOM How could one evaluate the putative cluster if the “right” answer were not known? Assumption: class discovery could be tested by class prediction. Testing of Assumption: Construct Predictors based on clusters A1 and A2. Construct Predictors based on random clusters Golub et al slides Unsupervised learning

73 Validation of SOM Predictions using predictors based on clusters A1 and A2 yields 34 accurate predictions, one error and three uncertains. Golub et al slides Unsupervised learning

74 Validation of SOM Golub et al slides Unsupervised learning

75 CONCLUSION In Machine Learning, every technique has its assumptions and constraints, advantages and limitations My view: First perform simple data analysis before applying fancy high tech ML methods Possibly use different ML techniques and then ensemble results Apply correct cross validation method! Check for significance of results (permutation test, stability of selected genes) Work in collaboration with data producer (biologist, pathologist) when possible! ML in bioinformatics


Download ppt "MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS"

Similar presentations


Ads by Google