Lecture 4 Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG.

Slides:



Advertisements
Similar presentations
Different types of data e.g. Continuous data:height Categorical data ordered (nominal):growth rate very slow, slow, medium, fast, very fast not ordered:fruit.
Advertisements

Clustering II.
Clustering.
Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG.
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Clustering Categorical Data The Case of Quran Verses
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
AEB 37 / AE 802 Marketing Research Methods Week 7
Cluster Analysis.
Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Cluster Analysis (from Chapter 12)
Clustering II.
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
Cluster analysis. Partition Methods Divide data into disjoint clusters Hierarchical Methods Build a hierarchy of the observations and deduce the clusters.
4. Ad-hoc I: Hierarchical clustering
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
Multiple sequence alignment
What is Cluster Analysis?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Dr. Michael R. Hyman Cluster Analysis. 2 Introduction Also called classification analysis and numerical taxonomy Goal: assign objects to groups so that.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
CLUSTER ANALYSIS.
Why is it useful to use multivariate statistical methods for microfacies analysis? A microfacies is a multivariate object: each sample is characterized.
© 2007 Prentice Hall20-1 Chapter Twenty Cluster Analysis.
Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations.
Technological Educational Institute Of Crete Department Of Applied Informatics and Multimedia Intelligent Systems Laboratory 1 CLUSTERS Prof. George Papadourakis,
Multivariate Data Analysis  G. Quinn, M. Burgman & J. Carey 2003.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Chapter 14 – Cluster Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Clustering.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Selecting Diverse Sets of Compounds C371 Fall 2004.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
Lecture 6 Ordination Ordination contains a number of techniques to classify data according to predefined standards. The simplest ordination technique is.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
1 Cluster Analysis Prepared by : Prof Neha Yadav.
Multivariate statistical methods Cluster analysis.
Data Mining and Text Mining. The Standard Data Mining process.
Chapter_20 Cluster Analysis Naresh K. Malhotra
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Unsupervised Learning
Multivariate statistical methods
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Multivariate community analysis
Hierarchical clustering approaches for high-throughput data
Clustering and Multidimensional Scaling
Information Organization: Clustering
Data Mining – Chapter 4 Cluster Analysis Part 2
Chapter_20 Cluster Analysis
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
Clustering The process of grouping samples so that the samples are similar within each group.
Cluster analysis Presented by Dr.Chayada Bhadrakom
Unsupervised Learning
Presentation transcript:

Lecture 4 Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.polaA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTCCGTATGCTATGTAGCTGGAGGGTACTGACGGTAG C.platA AATGCCTGACGTGGGAAATCAATAGGGCTAAGGAATTTATTTCGTATGCTATGTAGCTTAAGGGTACTGATTTTAG C.gradA AATGCCTGACGTGGGAAATCAATAGGGCTAAGGAATTTATTTCGTATGCTATGTAGCTTCCGGGTACTGATTTTAG D.symT TATGCGAGACGTGAAAAATCTTTAGGGCTAAGGTGATTATTTCGGTTGCTATGTAGAGGAAGGGTACTGACGGTAG Linkage algorithm Distance metric A cluster analysis is a two stepp process that needs includes the choice of a) a distance metric and b) a linkage algortihm

Between clusters Within clusters Cluster analysis tries to minimize within cluster distances and to maximize between cluster distances.

Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.polaA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTCCGTATGCTATGTAGCTGGAGGGTACTGACGGTAG C.platA AATGCCTGACGTGGGAAATCAATAGGGCTAAGGAATTTATTTCGTATGCTATGTAGCTTAAGGGTACTGATTTTAG C.gradA AATGCCTGACGTGGGAAATCAATAGGGCTAAGGAATTTATTTCGTATGCTATGTAGCTTCCGGGTACTGATTTTAG D.symT TATGCGAGACGTGAAAAATCTTTAGGGCTAAGGTGATTATTTCGGTTGCTATGTAGAGGAAGGGTACTGACGGTAG The distance metric P.symP.xanP.polaC.platC.gradD.sym P.sym P.xan P.pola C.plat C.grad D.sym A distance matrix counts in the simplest case the number of differences between two data sets.

Site 1 Site 2Site 3Site 4 P.sym1011 P.xan1001 P.pola0101 C.plat0111 C.grad1000 D.sym1011 Sum4235 Species presence-absence matrix A Site 1 Site 2Site 3Site 4 Site Site Site Site Site 1 Site 2Site 3Site 4 Site Site Site Site Distance matrix D = A T A Soerensen index Jaccard index

Site 1 Site 2Site 3Site 4 P.sym P.xan P.pola C.plat C.grad D.sym Sum Abundance data Euclidean distance Manhattan distance Correlation distance Site 1 Site 2Site 3Site 4 Site Site Site Site Correlation distance matrix Bray Curtis distance Due to squaring Euclidean distances put particulalry weight on outliers. Needs a linear scale. The Manhattan distance needs linear scales. Despite of a large distance the metric might be zero. Correlations are sensitive to non-linearities in the data. The Bray-Curtis distance is equivalent to the Soerensen index for presence-absence data. Suffers from the same shortcoming as the Manhattan distance.

P.symP.xanP.polaC.platC.gradD.sym P.sym P.xan P.pola C.plat C.grad D.sym Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.polaA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTCCGTATGCTATGTAGCTGGAGGGTACTGACGGTAG C.platA AATGCCTGACGTGGGAAATCAATAGGGCTAAGGAATTTATTTCGTATGCTATGTAGCTTAAGGGTACTGATTTTAG C.gradA AATGCCTGACGTGGGAAATCAATAGGGCTAAGGAATTTATTTCGTATGCTATGTAGCTTCCGGGTACTGATTTTAG D.symT TATGCGAGACGTGAAAAATCTTTAGGGCTAAGGTGATTATTTCGGTTGCTATGTAGAGGAAGGGTACTGACGGTAG Linkage algorithm We first combine species that are nearest to from an inner cluster In the next step we look for a species or a cluster that is clostest to the average distance or the initial cluster We continue this procedure until all species are grouped. The single linkage algorithm tends to produce many small clusters. P.sym P.xan P.pola C.plat C.grad D.sym

Sequential versus simultaneous algorithms In simultaneous algorithms the final solution is obtained in a single step and not stepwise as in the single linkage above. Agglomeration versus division algorithms Agglomerative procedures operate bottom up, division procedures top down. Monothetic versus polythetic algorithms Polythetic procedures use several descriptors of linkage, monothetic use the same at each step (for instance maximum association). Hierarchical versus non-hierarchical algorithms Hierarchical methods proceed in a non- overlapping way. During the linkage process all members of lower clusters are members of the next higher cluster. Non hierarchical methods proceed by optimization within group homogeneity. Hence they might include members not contained in higher order cluster. The single linkage algorithm uses the minimum distance between the members of two clusters as the measure of cluster distance. It favours chains of small clusters. The average linkage uses average distances between clusters. It gives frequently larger clusters. The most often used average linkage algorithm is the Unweighted Pair-Groups Method Average (UPGMA). The Ward algorithm calculates the total sum of squared deviations from the mean of a cluster and assigns members as to minimize this sum. The method gives often clusters of rather equal size. Median clustering tries to minimize within cluster variance.

To check the performance of different cluster algorithms and distance metrics we use a matrix of random numbers. Which clusters to accept?

Different cluster algorithms give different results. We accept those clusters that are stable irrespective of algorithm. In the case of our random numbers clustering is very unstable.

Two methods detected the clusters OP and ABC All other items are not clearly separated. The position of item F remains unclear

Clustering using a predefined number of clusters K-means O P A B D C F E H K I LN M J G K-means clustering starts from a predefind number of clusters and then arranges the items in a way that the distances between clusters are maximized with respect to the distances within the clusters. Technically the algorithm first randomly assigns cluster means and then places items (each time calculating new cluster means) until an optimal solution (convergence) has been reached). K-means always uses Euclidean distances

Neighbour joining Neighbour joining is particularly used to generate phylogenetic trees Dissimilarities You need similarities (phylogenetic distances)  (XY) between all elements X and Y. Select the pair with the lowest value of Q Calculate new dissimilarities Calculate the distancies from the new node Calculate