Dendrograms for Data Mining

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Conceptual Clustering
Clustering.
Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
PARTITIONAL CLUSTERING
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Machine Learning and Data Mining Clustering
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Introduction to Bioinformatics
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Clustering II.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
What is Clustering? Also called unsupervised learning, sometimes called classification by statisticians and sorting by psychologists and segmentation by.
Measurement of Similarity and Clustering
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Read “chainLetters.pdf” in my public directory C. H. Bennett, M. Li, and B. Ma, "Chain letters and evolutionary histories", Scientific Amer., pp. 76 -
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Computer Vision James Hays, Brown
Clustering Slides by Eamonn Keogh.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Organizing data into classes such that there is high intra-class similarity low inter-class similarity Finding the class labels and the number of classes.
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
CISC 4631 Data Mining Lecture 08: Clustering Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Initial slides by Eamonn Keogh Clustering. Organizing data into classes such that there is high intra-class similarity low inter-class similarity Finding.
Classification Problem Class F1 F2 F3 F
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
The next 6 pages are mini reviews of your ML classification skills.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Chapter 26 Phylogeny and the Tree of Life
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Data Mining and Text Mining. The Standard Data Mining process.
Introduction to Clustering
Unsupervised Learning
Data Mining: Basic Cluster Analysis
Unsupervised Learning: Clustering
Hierarchical Clustering
Unsupervised Learning: Clustering
Web-Mining Agents Clustering
Clustering CSC 600: Data Mining Class 21.
Chapter 15 – Cluster Analysis
The next 6 pages are mini reviews of your ML classification skills
Data Mining K-means Algorithm
Hierarchical Clustering
Hierarchical clustering approaches for high-throughput data
Information Organization: Clustering
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
DATA MINING Introductory and Advanced Topics Part II - Clustering
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
Hierarchical Clustering
CS 685: Special Topics in Data Mining Jinze Liu
Unsupervised Learning
Presentation transcript:

Dendrograms for Data Mining Eamonn Keogh Dendrograms for Data Mining

What is Clustering? Also called unsupervised learning, sometimes called classification by statisticians and sorting by psychologists and segmentation by people in marketing Informally, finding natural groupings or relationships among objects.

What is a natural grouping among these objects?

What is a natural grouping among these objects? Clustering is subjective Simpson's Family School Employees Females Males

Two Types of Clustering Partitional algorithms: Construct various partitions and then evaluate them by some criterion Hierarchical algorithms: Create a hierarchical decomposition of the set of objects using some criterion Hierarchical Partitional

What is Similarity? Webster's Dictionary The quality or state of being similar; likeness; resemblance; as, a similarity of features. Similarity is hard to define, but… “We know it when we see it” The real meaning of similarity is a philosophical question. We will take a more pragmatic approach.

Defining Distance Measures Definition: Let O1 and O2 be two objects from the universe of possible objects. The distance (dissimilarity) between O1 and O2 is a real number denoted by D(O1,O2) Peter Piotr 0.23 3 342.7

A Useful Tool for Summarizing Similarity Measurements Introducing the dendrogram. {Cladogram, Phylogenetic Tree, phylogram} The similarity between two objects in a dendrogram is represented as the height of the lowest internal node they share.

There is only one dataset that can be perfectly clustered using a hierarchy… (Bovine:0.69395, (Spider Monkey 0.390, (Gibbon:0.36079,(Orang:0.33636,(Gorilla:0.17147,(Chimp:0.19268, Human:0.11927):0.08386):0.06124):0.15057):0.54939);

Business & Economy B2B Finance Shopping Jobs Note that hierarchies are commonly used to organize information, for example in a web portal. Yahoo’s hierarchy is manually created, we will focus on automatic creation of hierarchies in data mining. Business & Economy B2B Finance Shopping Jobs Aerospace Agriculture… Banking Bonds… Animals Apparel Career Workspace

Piotr Pyotr Peka Petros Pietro Pierre Piero Peter Peadar A Demonstration of Hierarchical Clustering using String Edit Distance Pedro (Portuguese) Petros (Greek), Peter (English), Piotr (Polish), Peadar (Irish), Pierre (French), Peder (Danish), Peka (Hawaiian), Pietro (Italian), Piero (Italian Alternative), Petr (Czech), Pyotr (Russian) Cristovao (Portuguese) Christoph (German), Christophe (French), Cristobal (Spanish), Cristoforo (Italian), Kristoffer (Scandinavian), Krystof (Czech), Christopher (English) Miguel (Portuguese) Michalis (Greek), Michael (English), Mick (Irish!) Piotr Pyotr Peka Petros Pietro Pierre Piero Peter Peadar Pedro Peder Mick Michalis Michael Miguel Krystof Cristovao Christoph Crisdean Cristobal Christophe Cristoforo Kristoffer Christopher

Piotr Pyotr Peter Peka Petros Pietro Pierre Piero Peadar Pedro Peder Pedro (Portuguese/Spanish) Petros (Greek), Peter (English), Piotr (Polish), Peadar (Irish), Pierre (French), Peder (Danish), Peka (Hawaiian), Pietro (Italian), Piero (Italian Alternative), Petr (Czech), Pyotr (Russian) Piotr Pyotr Pedro Peter Peka Petros Pietro Pierre Piero Peder Peadar

Hierarchal clustering can sometimes show patterns that are meaningless or spurious For example, in this clustering, the tight grouping of Australia, Anguilla, St. Helena etc is meaningful, since all these countries are former UK colonies. However the tight grouping of Niger and India is completely spurious, there is no connection between the two. ANGUILLA AUSTRALIA St. Helena & Dependencies South Georgia & South Sandwich Islands U.K. Serbia & Montenegro (Yugoslavia) FRANCE NIGER INDIA IRELAND BRAZIL

The flag of Niger is orange over white over green, with an orange disc on the central white stripe, symbolizing the sun. The orange stands the Sahara desert, which borders Niger to the north. Green stands for the grassy plains of the south and west and for the River Niger which sustains them. It also stands for fraternity and hope. White generally symbolizes purity and hope. The Indian flag is a horizontal tricolor in equal proportion of deep saffron on the top, white in the middle and dark green at the bottom. In the center of the white band, there is a wheel in navy blue to indicate the Dharma Chakra, the wheel of law in the Sarnath Lion Capital. This center symbol or the 'CHAKRA' is a symbol dating back to 2nd century BC. The saffron stands for courage and sacrifice; the white, for purity and truth; the green for growth and auspiciousness. ANGUILLA AUSTRALIA St. Helena & Dependencies South Georgia & South Sandwich Islands U.K. Serbia & Montenegro (Yugoslavia) FRANCE NIGER INDIA IRELAND BRAZIL

We can look at the dendrogram to determine the “correct” number of clusters In this case, the two highly separated subtrees are highly suggestive of two clusters. (Things are rarely this clear cut, unfortunately)

Outlier One potential use of a dendrogram is to detect outliers The single isolated branch is suggestive of a data point that is very different to all others Outlier

(How-to) Hierarchical Clustering The number of dendrograms with n leafs = (2n -3)!/[(2(n -2)) (n -2)!] Number Number of Possible of Leafs Dendrograms 2 1 3 3 4 15 5 105 ... … 34,459,425 Since we cannot test all possible trees we will have to heuristic search of all possible trees. We could do this.. Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together. Top-Down (divisive): Starting with all the data in a single cluster, consider every possible way to divide the cluster into two. Choose the best division and recursively operate on both sides.

We begin with a distance matrix which contains the distances between every pair of objects in our database. 8 7 2 4 3 1 D( , ) = 8 D( , ) = 1

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together. Consider all possible merges… Choose the best …

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together. Consider all possible merges… Choose the best … Consider all possible merges… Choose the best …

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together. Consider all possible merges… Choose the best … Consider all possible merges… Choose the best … Consider all possible merges… Choose the best …

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together. Consider all possible merges… Choose the best … Consider all possible merges… Choose the best … Consider all possible merges… Choose the best …

We know how to measure the distance between two objects, but defining the distance between an object and a cluster, or defining the distance between two clusters is non obvious. Single linkage (nearest neighbor): In this method the distance between two clusters is determined by the distance of the two closest objects (nearest neighbors) in the different clusters. Complete linkage (furthest neighbor): In this method, the distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the "furthest neighbors"). Group average linkage: In this method, the distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters. Wards Linkage: In this method, we try to minimize the variance of the merged clusters

Single linkage Average linkage Wards linkage

Summary of Hierarchal Clustering Methods No need to specify the number of clusters in advance. Hierarchal nature maps nicely onto human intuition for some domains They do not scale well: time complexity of at least O(n2), where n is the number of total objects. Like any heuristic search algorithms, local optima are a problem. Interpretation of results is (very) subjective.

Johnson WE, Eizirik E, Pecon-Slattery J, et al. (January 2006) Johnson WE, Eizirik E, Pecon-Slattery J, et al. (January 2006). "The late Miocene radiation of modern Felidae: a genetic assessment". Science Trees are very useful in biology… Why are most subtrees just one or two colors, but the sub-tree containing the Lion has one of each color?

“Irish/Welsh Split: Must be before 300AD “Irish/Welsh Split: Must be before 300AD. Archaic Irish inscriptions date back to the 5th century AD – divergence must have occurred well before this time.” How do we know the dates? If we can get dates, even upper/lower bounds, some events, we can interpolate to the rest of the tree. Darwin himself noted that languages seem to fit well with trees… Gray, R.D. and Atkinson, Q. D., Language tree divergence times support the Anatolian theory of Indo-European origin

Do Trees Make Sense for non-Biological Objects? Hellenic Armenian Persian Gibbon Sumatran Orangutan Orangutan Gorilla Persian Hellenic Armenian Human Pygmy Chimp Chimpanzee “Armenian borrowed so many words from Iranian languages that it was at first considered a branch of the Indo-Iranian languages, and was not recognized as an independent branch of the Indo-European languages for many decades” The answer is “Yes”. There are increasing theoretical and empirical results to suggest that phylogenetic methods work for cultural artifacts. Does horizontal transmission invalidate cultural phylogenies? Greenhill, Currie & Gray. Branching, blending, and the evolution of cultural similarities and differences among human populations. Collard, Shennan, & Tehrani. ..results show that trees constructed with Bayesian phylogenetic methods are robust to realistic levels of borrowing

Here the results are very good, but not perfect. On trick to text the applicably of phylogenetic methods outside of biology is to test on datasets for which you know the right answer by other methods. “Canadian Football is historically derived from the ancestor of rugby, but today closely resembles the American versions of the game. In this branch of the tree geography has trumped deeper phylogenetic history.” Here the results are very good, but not perfect. Gray, RD, Greenhill, SJ, & Ross, RM (2007). The Pleasures and Perils of Darwinizing Culture (with phylogenies). Biological Theory, 2(4)

Why would we want to use trees for human artifacts? Because trees are powerful in biology They make predictions Pacific Yew produces taxol which treats some cancers, but it is expensive. Its nearest relative, the European Yew was also found to produce taxol. They tell us the order of events Which came first, classic geometric spider webs, or messy cobwebs? They tell us about.. “Homelands”, where did it come from. “Dates” when did it happen. Rates of change Ancestral states Which came first, classic geometric spider webs, or messy cobwebs? Surprisingly: The classic geometric spider webs came first, then evolved into the messy cobwebs. We know this because the “messy cobwebs” spiders are in a subtree embedded in a larger tree that consists mostly of “geometric web” spiders.

They tell us the order of events Which came first, classic geometric orb webs, or messy cobwebs? A cobweb is a tangled mass of fluffy silk that catches insects An orb web is shaped like a circle with spokes Which came first, classic geometric spider webs, or messy cobwebs? Surprisingly: The classic geometric spider webs came first, then evolved into the messy cobwebs. We know this because the “messy cobwebs” spiders are in a subtree embedded in a larger tree that consists mostly of “geometric web” spiders.

Data is coded as a 90-dimensional binary vector. Most parsimonious cladogram for Turkmen textile designs. Lobed gul: birds Lobed gul: clovers Lobed gul :: 1=presence 0=absence 1 0 1 1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 1 0 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 0 1 0 Salor = Data is coded as a 90-dimensional binary vector. However this is arbitrary in two ways, why these 90 features, and does this carpet really have Lobed Gul clovers? J. Tehrani, M. Collard / Journal of Anthropological Archaeology 21 (2002) 443–463

21225212 Data is coded as a 8-dimensional integer vector. I. Location of maximum blade width 1. Proximal quarter 2. Secondmost proximal quarter 3. Secondmost distal quarter 4. Distal quarter II. Base shape 1. Arc-shaped 2. Normal curve 3. Triangular 4. Folsomoid III. Basal indentation ratio 1. No basal indentation 2. 0·90–0·99 (shallow) 3. 0·80–0·89 (deep) IV. Constriction ratio 1·00 0·90–0·99 0·80–0·89 4. 0·70–0·79 5. 0·60–0·69 0·50–0·59 V. Outer tang angle 1. 93–115 2. 88–92 3. 81–87 4. 66–88 5. 51–65 6. <50 VI. Tang-tip shape 1. Pointed 2. Round 3. Blunt VII. Fluting 1. Absent 2. Present VIII. Length/width ratio 1. 1·00–1·99 2. 2·00–2·99 - 4. 4·00–4·99 5. 5·00–5·99 6. >6. 6·00 21225212 Data is coded as a 8-dimensional integer vector. Cladistics Is Useful for Reconstructing Archaeological Phylogenies: Palaeoindian Points from the Southeastern United States. O’Brien

An Inverted use of Dendrograms Suppose you have a new distance measure A, and you want to claim it is better than the old method B You could run some classification experiments and report some numbers.. A get 95% B gets 90% But this is not very forceful, and it does not tell you when you win/lose

B A Tettigonioidea Grylloidea Conocephalus nemoralis Aglaothorax ovata 3 4 2 1 8 10 11 5 12 9 6 7 One Second A Tettigonioidea Grylloidea 11 12 7 8 9 10 1 2 3 4 5 6 One Second Conocephalus nemoralis Aglaothorax ovata Amblycorypha huasteca Cyrtoxipha columbiana Neocurtilla hexadactyla Hispanogryllus nesion

B A

Note that the algorithm has no access to color information, just texture Dictionnaire D'Histoire Naturelle by Charles Orbigny. 1849

The algorithm can handle very subtle differences.  Ornaments from the Hand-Press period. The particularity of this period is the use of block of wood to print the ornaments on the books. The specialist historians want to record the individual instances of ornament occurrence and to identify the individual blocks. This identification work is very useful to date the books and to authenticate outputs from some printing-houses and authors. Indeed, numerous editions published in the past centuries do not reveal on the title page their true origin. The blocks could be re-used to print several books, be exchanged between the printing-houses or duplicated in the case of damage. Mathieu Delalandre

Indexing and Mining Rock Art Rock art is found on every continent except Antarctica. To date, computer science has had little impact on analysis of rock art. Australia may have 100 million examples A decade ago, Walt et al. summed up the state of petroglyph research by noting, “Complete-site and cross-site research thus remains impossible, incomplete, or impressionistic”

clustering, classification, indexing motif discovery. If we assume that we have high quality binary images of rock art, then we can do clustering, classification, indexing motif discovery. Atlatls Anthropomorphs One challenge is designing distance measures. For example, we would like to find and similar, even though one is solid and one is hollow. Bighorn Sheep function UI_Dendrogram_GHT(SIZE) %plot the dendrogram for images in the folder chosen by user D = uigetdir('','Choose a Directory containing images'); flist=dir(D); fnum = length(flist); i = 1; j = 0; while i <= fnum if ~flist(i).isdir && ~strcmp(flist(i).name,'Thumbs.db') j = j + 1; I{j} = imread(sprintf('%s/%s',D,flist(i).name)); try %if I{j} is not binary image_bw = ~im2bw(I{j},graythresh(I{j})); catch image_bw = ~I{j}; end %% thin and resize image_thin = bwmorph(image_bw,'thin',3);%Thinning image_thin = bwmorph(image_thin,'remove'); %Removes interior pixels I_nor{j} = normalize_img(~image_thin,SIZE); %Normalization i = i + 1; Y = im_pdist(I_nor,j); %Calculate the distance Z = linkage(Y , 'average' ); figure; subplot('Position',[0.05,.2,0.9,.75]); [H, T, P] = dendrogram(Z); axis off; w = 0.9/(j+1); for i=1:j subplot('Position',[0.05+w*0.5+w*.05+w*(i-1),.01,w*0.9,.18]); imshow(I{P(i)}); function img_nor = normalize_img(img,SIZE) %normalize img to SIZE*SIZE [idy idx] = find(img == 0); minx = min(idx); maxx = max(idx); miny = min(idy); maxy = max(idy); xm = (minx+maxx)/2; ym = (miny+maxy)/2; scale = max((maxx-minx)/2,(maxy-miny)/2); img_nor = ones(SIZE,SIZE); for i = 1:length(idy) y = round(SIZE/2+SIZE/2*(idy(i)-ym)/scale); x = round(SIZE/2+SIZE/2*(idx(i)-xm)/scale); if y == 0 y = 1; if x == 0 x = 1; img_nor(y,x) = 0; function D = im_pdist(I,n) %%n is total number of the images D(round((n*(n-1))/2)) = 1; k = 1; for i = 1:n for j = i+1:n D(k) = dist_GHT(I{j},I{i}) + dist_GHT(I{i},I{j}); k = k + 1; function dist = dist_GHT(img_train,img_query) w = size(img_train,2); h = size(img_train,1); %Find the center of the query image [idy, idx] = find(img_query == 0); pointsnum_query = length(idy); p = [idx, idy]; % x and y coordinates c = round(mean(p)); %Compute the voting vector dv for each point cmat = repmat(c, size(p,1), 1); dv = cmat - p; %Find all black pixels in train image [idy, idx] = find(img_train == 0); pointsnum_train = length(idy); q = [idx, idy]; vote_space = zeros(h,w); for n = 1 : size(q,1) pmat = repmat(q(n,:), size(dv,1), 1); %vote for the center vote_points = pmat + dv; id = find(vote_points(:,1) >= 1 & vote_points(:,1) <= w &... vote_points(:,2) >= 1 & vote_points(:,2) <= h); if ~isempty(id) pid = vote_points(id,2) + (vote_points(id,1) - 1) * h; vote_space(pid) = vote_space(pid) + 1; max_value = max(vote_space(:)); if pointsnum_train > pointsnum_query %larger "pointsnum_train" tends to generate larger "max_value" for the same query image dist = sqrt(pointsnum_train / pointsnum_query) / max_value; else dist = 1/max_value; dist = sqrt(pointsnum_train * pointsnum_query) * dist; %larger "pointsnum_train" and "pointsnum_query" tends to generate larger max_value *Zhu, Wang, Keogh, Lee (2009). Augmenting the Generalized Hough Transform to Enable the Mining of Petroglyphs. SIGKDD 2009

Eamonn Keogh Computer Science & Engineering Department University of California – Riverside eamonn@cs.ucr.edu