MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS

Name: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS
Uploaded: 2017-07-13T16:40:29+00:00
Duration: PTM30S7
Description: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS

MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS
Elena Marchiori IBIVU Vrije Universiteit Amsterdam

Summary Machine Learning Supervised Learning: classification
Unsupervised Learning: clustering prova

Machine Learning (ML) ? ML Computational model
Construct a computational model from a dataset describing properties of an unknown (but existent) system. observations System (unknown) properties ? prova ML Computational model prediction

Supervised Learning The dataset describes examples of input-output behaviour of a unknown (but existent) system. The algorithm tries to find a function ‘equivalent’ to the system. ML techniques for classification: K-nearest neighbour, decision trees, Naïve Bayes, Support Vector Machines. prova

? Supervised Learning model property of interest observations
System (unknown) supervisor Training data ? ML algorithm new observation model prediction Unsupervised learning

Example: A Classification Problem
Categorize images of fish—say, “Atlantic salmon” vs. “Pacific salmon” Use features such as length, width, lightness, fin shape & number, mouth position, etc. Steps Preprocessing (e.g., background subtraction) Feature extraction Classification example from Duda & Hart

Classification in Bioinformatics
Computational diagnostic: early cancer detection Tumor biomarker discovery Protein folding prediction Protein-protein binding sites prediction From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Classification Techniques
Naïve Bayes K Nearest Neighbour Support Vector Machines (next lesson) … From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Bayesian Approach Each observed training example can incrementally decrease or increase probability of hypothesis instead of eliminate an hypothesis Prior knowledge can be combined with observed data to determine hypothesis Bayesian methods can accommodate hypotheses that make probabilistic predictions New instances can be classified by combining the predictions of multiple hypotheses, weighted by their probabilities Kathleen McKeown’s slides

Bayesian Approach Assign the most probable target value, given <a1,a2,…an> VMAP=argmax P(vj| a1,a2,…an) Using Bayes Theorem: VMAP=argmax P(a1,a2,…an|vj)P(vi) vjV P(a1,a2,…an) =argmax P(a1,a2,…an|vj)P(vi) vjV Bayesian learning is optimal Easy to estimate P(vi) by counting in training data Estimating the different P(a1,a2,…an|vj) not feasible (we would need a training set of size proportional to the number of possible instances times the number of classes) Kathleen McKeown’s slides

Naïve Bayes Assume independence of attributes
P(a1,a2,…an|vj)=∏P(ai|vj) i Substitute into VMAP formula VNB=argmax P(vj)∏P(ai|vj) vjV i Kathleen McKeown’s slides

VNB=argmax P(vj)∏P(ai|vj) vjV
S-length S-width P-length Class 1 high Versicolour 2 low Setosa 3 Verginica 4 med 5 6 7 8 9 Kathleen McKeown’s slides

Estimating Probabilities
What happens when the number of data elements is small? Suppose true P(S-length=low|verginica)=.05 There are only 2 instances with C=Verginica We estimate probability by nc/n using the training set Then #S-length =low |Verginica must = 0 Then, instead of .05 we use estimated probability of 0 Two problems Biased underestimate of probability This probability term will dominate if future query contains S-length=low Kathleen McKeown’s slides

Instead: use m-estimate
Use priors as well nc+mp n+m Where p = prior estimate of P(S-length=low|verginica) m is a constant called the equivalent sample size Determines how heavily to weight p relative to the observed data Typical method: assume a uniform prior of an attribute (e.g. if values low,med,high -> p =1/3) Kathleen McKeown’s slides

K-Nearest Neighbour Memorize the training data
Given a new example, find its k nearest neighbours, and output the majority vote class. Choices: How many neighbours? What distance measure? prova

Application in Bioinformatics
A Regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data, Z. Yao and W.L. Ruzzo, BMC Bioinformatics 2006, 7 For each dataset k, for each pair of genes p compute similarity fk(p) of p wrt k-th data Construct predictor of gene pair similarity, e.g. logistic regression H: f(p,1),…,f(p,m)  H(f(p,1),…,f(p,m)) such that H high value if genes of p have similar functions. Given a new gene g find kNN using H as distance Predict the functional classes C1, .., Cn of g with confidence equal to Confidence(Ci) = 1- Π (1- Pij) with gj neighbour of g and Ci in the set of classes of gj (probability that at least one prediction is correct, that is 1 – probability that all predictions are wrong) prova

Classification: CV error
N samples Training error Empirical error Error on independent test set Test error Cross validation (CV) error Leave-one-out (LOO) N-fold CV splitting 1/n samples for testing N-1/n samples for training Count errors Summarize CV error rate Supervised learning

Two schemes of cross validation
CV1 CV2 N samples N samples LOO Gene selection Train and test the gene-selector and the classifier LOO Train and test the classifier Count errors Count errors Supervised learning

Difference between CV1 and CV2
CV1 gene selection within LOOCV CV2 gene selection before before LOOCV CV2 can yield optimistic estimation of classification true error CV2 used in paper by Golub et al. : 0 training error 2 CV error (5.26%) 5 test error (14.7%) CV error different from test error! Supervised learning

Significance of classification results
Permutation test: Permute class label of samples LOOCV error on data with permuted labels Repeat process a high number of times Compare with LOOCV error on original data: P-value = (# times LOOCV on permuted data <= LOOCV on original data) / total # of permutations considered Supervised learning

Unsupervised Learning
ML for unsupervised learning attempts to discover interesting structure in the available data Unsupervised learning

Unsupervised Learning
The dataset describes the structure of an unknown (but existent) system. The computer program tries to identify structure of the system (clustering, data compression). ML techniques: hierarchical clustering, k-means, Self Organizing Maps (SOM), fuzzy clustering (described in a future lesson). prova

Clustering Clustering is one of the most important unsupervised learning processes for organizing objects into groups whose members are similar in some way. Clustering finds structures in a collection of unlabeled data. A cluster is a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters.

Clustering Algorithms
Start with a collection of n objects each represented by a p–dimensional feature vector xi , i=1, …n. The goal is to associatethe n objects to k clusters so that objects “within” a clusters are more “similar” than objects between clusters. k is usually unknown. Popular methods: hierarchical, k-means, SOM, … From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Hierarchical Clustering
Venn Diagram of Clustered Data Dendrogram From

Hierarchical Clustering (Cont.)
Multilevel clustering: level 1 has n clusters  level n has one cluster. Agglomerative HC: starts with singleton and merge clusters. Divisive HC: starts with one sample and split clusters.

Nearest Neighbor Algorithm
Nearest Neighbor Algorithm is an agglomerative approach (bottom-up). Starts with n nodes (n is the size of our sample), merges the 2 most similar nodes at each step, and stops when the desired number of clusters is reached. From

Nearest Neighbor, Level 2, k = 7 clusters.
From

Nearest Neighbor, Level 8, k = 1 cluster.

Hierarchical Clustering
Calculate the similarity between all possible combinations of two profiles Keys Similarity Clustering Two most similar clusters are grouped together to form a new cluster Calculate the similarity between the new cluster and all remaining clusters. From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Clustering in Bioinformatics
Microarray data quality checking Does replicates cluster together? Does similar conditions, time points, tissue types cluster together? Cluster genes  Prediction of functions of unknown genes by known ones Cluster samples  Discover clinical characteristics (e.g. survival, marker status) shared by samples. Promoter analysis of commonly regulated genes From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Functional significant gene clusters
Two-way clustering Sample clusters Gene clusters

Bhattacharjee et al. (2001) Human lung carcinomas mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad. Sci. USA, Vol. 98,

Similarity Measurements
Pearson Correlation Two profiles (vectors) and +1  Pearson Correlation  – 1 From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Pearson Correlation: Trend Similarity From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Euclidean Distance From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Euclidean Distance: Absolute difference From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Clustering C1 Merge which pair of clusters? C2 C3
From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Clustering Single Linkage
Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters + + C2 C1 Tend to generate “long chains” From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Clustering Complete Linkage
Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters + + C2 C1 Tend to generate “clumps” From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Clustering Average Linkage
Dissimilarity between two clusters = Averaged distances of all pairs of objects (one from each cluster). + + C2 C1 From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Clustering Average Group Linkage
Dissimilarity between two clusters = Distance between two cluster means. + + C2 C1 From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

Considerations What genes are used to cluster samples?
Expression variation Inherent variation Prior knowledge (irrelevant genes) Etc. From: Introduction to Hierarchical Clustering Analysis, Pengyu Hong

K-means Clustering Initialize the K cluster representatives w’s, e.g. to randomly chosen examples. Assign each input example x to the cluster c(x) with the nearest corresponding weight vector: Update the weights: Increment n by 1 and go until no noticeable changes of the cluster representatives occur. Unsupervised learning

Example I Initial Data and Seeds Final Clustering
Unsupervised learning

Example II Initial Data and Seeds Final Clustering

SOM: Brain’s self-organization
The brain maps the external multidimensional representation of the world into a similar 1 or 2 - dimensional internal representation. That is, the brain processes the external signals in a topology-preserving way Mimicking the way the brain learns, our clustering system should be able to do the same thing. Unsupervised learning

Self-Organized Map: idea
Data: vectors XT = (X1, ... Xd) from d-dimensional space. Grid of nodes, with local processor (called neuron) in each node. Local processor # j has d adaptive parameters W(j). Goal: change W(j) parameters to recover data clusters in X space. Unsupervised learning

Training process Java demos: Unsupervised learning

Concept of the SOM Input space Reduced feature space
Ba s1 s2 Mn Sr Cluster centers (code vectors) Place of these code vectors in the reduced space Clustering and ordering of the cluster centers in a two dimensional grid Unsupervised learning

Concept of the SOM SA3 SA3 Ba Mn Sr … Mg Unsupervised learning
We can use it for visualization Ba Mn SA3 We can use it for classification Sr SA3 … Mg We can use it for clustering Unsupervised learning

SOM: learning algorithm
Initialization. n=0. Choose random small values for weight vectors components. Sampling. Select an x from the input examples. Similarity matching. Find the winning neuron i(x) at iteration n: Updating: adjust the weight vectors of all neurons using the following rule Continuation: n = n+1. Go to the Sampling step until no noticeable changes in the weights are observed. Unsupervised learning

Neighborhood Function
Gaussian neighborhood function: dji: lateral distance of neurons i and j in a 1-dimensional lattice | j - i | in a 2-dimensional lattice || rj - ri || where rj is the position of neuron j in the lattice. Unsupervised learning

Initial h function (Example )

Some examples of real-life applications
Helsinki University of Technology web site Contains > 5000 papers on SOM and its applications: Brain research: modeling of formation of various topographical maps in motor, auditory, visual and somatotopic areas. Clusterization of genes, protein properties, chemical compounds, speech phonemes, sounds of birds and insects, astronomical objects, economical data, business and financial data .... Data compression (images and audio), information filtering. Medical and technical diagnostics. Unsupervised learning

Issues in Clustering How many clusters? What similarity measure?
User parameter Use model selection criteria (Bayesian Information Criterion) with penalization term which considers model complexity. See e.g. X-means: What similarity measure? Euclidean distance Correlation coefficient Ad-hoc similarity measures Unsupervised learning

Validation of clustering results
External measures According to some external knowledge Consideration of bias and subjectivity Internal measures Quality of clusters according to the data Compactness and separation Stability … See e.g. J.Handl, J.Knowles, D.B.Kell Computational cluster validation in postgenomic data analysis, Bioinformatics, 21(15): , 2005 Unsupervised learning

T.R. Golub et al., Science 286, 531 (1999)
Bioinformatics Application T.R. Golub et al., Science 286, 531 (1999) Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring Unsupervised learning

Identification of cancer types
Why is Identification of Cancer Class (tumor sub-type) important? Cancers of Identical grade can have widely variable clinical courses (i.e. acute lymphoblastic leukemia, or Acute myeloid leukemia). Tradition Method: Morphological appearance. Enzyme-based histochemical analyses. Immunophenotyping. Cytogenetic analysis. ALL: Corticosteroids, vincristine, methotrexate, and L-asparaginase AML: Daunorubicin and cytarabine. Chemotherapy is highly toxic. Very accurate. Problem with Traditional method: each performed in a separate, highly specialized laboratory. Golub et al 1999 Unsupervised learning

Class Prediction How could one use an initial collection of samples belonging to know classes to create a class Predictor? Identification of Informative Genes Weighted Vote Golub et al slides Unsupervised learning

Data Initial Sample: 38 Bone Marrow Samples (27 ALL, 11 AML) obtained at the time of diagnosis. Independent Sample: 34 leukemia consisted of 24 bone marrow and 10 peripheral blood samples (20 ALL and 14 AML). Affy, 6817 genes. Golub et al slides Unsupervised learning

Validation of Gene Voting
Initial Samples: 36 of the 38 samples as either AML or ALL and two as uncertain. All 36 samples agrees with clinical diagnosis. Independent Samples: 29 of 34 samples are strongly predicted with 100% accuracy. Golub et al slides Unsupervised learning

Class Discovery Can cancer classes be discovered automatically based on gene expression? Cluster tumors by gene expression Determine whether the putative classes produced are meaningful. Golub et al slides Unsupervised learning

Cluster tumors Self-organization Map (SOM)
Mathematical cluster analysis for recognizing and clasifying feautres in complex, multidimensional data (similar to K-mean approach) Chooses a geometry of “nodes” Nodes are mapped into K-dimensional space, initially at random. Iteratively adjust the nodes. Golub et al slides Unsupervised learning

Validation of SOM Prediction based on cluster A1 and A2:
24/25 of the ALL samples from initial dataset were clustered in group A1 10/13 of the AML samples from initial dataset were clustered in group A2 Golub et al slides Unsupervised learning

Validation of SOM How could one evaluate the putative cluster if the “right” answer were not known? Assumption: class discovery could be tested by class prediction. Testing of Assumption: Construct Predictors based on clusters A1 and A2. Construct Predictors based on random clusters Golub et al slides Unsupervised learning

Validation of SOM Predictions using predictors based on clusters A1 and A2 yields 34 accurate predictions, one error and three uncertains. Golub et al slides Unsupervised learning

Validation of SOM Golub et al slides Unsupervised learning

CONCLUSION In Machine Learning, every technique has its assumptions and constraints, advantages and limitations My view: First perform simple data analysis before applying fancy high tech ML methods Possibly use different ML techniques and then ensemble results Apply correct cross validation method! Check for significance of results (permutation test, stability of selected genes) Work in collaboration with data producer (biologist, pathologist) when possible! ML in bioinformatics

MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS

Similar presentations

Presentation on theme: "MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS

Similar presentations

Presentation on theme: "MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS"— Presentation transcript:

Similar presentations

About project

Feedback