Download presentation

Presentation is loading. Please wait.

Published byGian Redmon Modified about 1 year ago

1
Data Clustering with Application to Relational Data Adam Anthony Ph.D. Candidate University of Maryland Baltimore County Adviser: Marie desJardins

2
2 Overview Clustering Tutorial My work: Relational Data Clustering –Relational Data Examples –Sources of Information in Relational Data Clustering –Fast approximate Relational Data Clustering –Constraining Solutions in Relational Data Clustering –Relation Selection Conclusion

3
3 What Is Data Clustering? Clustering = grouping objects into categories without outside input Quality of a clustering depends on an objective: –Which clustering is better? By rank By suit By color Combinations

4
4 Clustering: An Intelligence Perspective Why is clustering considered an intelligent activity? –What are the categories? {Squirrel, Marlin, Salmon, Mouse, Tuna, Bat} – How many faces? But there’s more to it… { aardvark, addax, alligator, alpaca, anteater, antelope, aoudad, ape, argali, armadillo, ass, baboon, badger, basilisk, bat, bear, beaver, bighorn, bison, boar, budgerigar, buffalo, bull, bunny, burro, camel, canary, capybara cat, chameleon, chamois, cheetah, chimpanzee, chinchilla, chipmunk, civet, coati, colt, cony, cougar, cow, coyote, crocodile, crow, deer, dingo, doe, dog, donkey, dormouse, dromedary, duckbill, dugong, eland, elephant, elk, ermine, ewe, fawn, ferret, finch, fish, fox, frog, gazelle, gemsbok, gila_monster, giraffe, gnu, goat, gopher, gorilla, grizzly_bear, ground_hog, guanaco, guinea_pig, hamster, hare, hartebeest, hedgehog, hippopotamus, hog, horse, hyena, ibex, iguana, impala, jackal, jaguar, jerboa, kangaroo, kid, kinkajou, kitten, koala, koodoo, lamb, lemur, leopard, lion, lizard, llama, lovebird, lynx, mandrill, mare, marmoset, marten, mink, mole, mongoose, monkey, moose, mountain_goat, mouse, mule, musk_deer, musk_ox, muskrat, mustang, mynah_bird, newt, ocelot, okapi, opossum, orangutan, oryx, otter, ox, panda, panther, parakeet, parrot, peccary, pig, platypus, polar_bear, pony, porcupine, porpoise, prairie_dog, pronghorn, puma, puppy, quagga, rabbit }

5
5 Clustering: An Agent’s Perspective An agent has three short- and long-range binary sensors: –Light (high/low) –Heat (high/low) –Damaged (yes/no) Clustering can be used to predict unknown values –Repair station (with lightbulb) –Candle (causes damage) How can clustering help this agent? –Agent can predict and avoid damage using clustering –Clustering can also filter out irrelevant information: Add a noise sensor, but noise never causes damage

6
6 Formal Data Clustering Data clustering is: –Dividing a set of data objects into groups such that there is a clear pattern (e.g. similarity to each other) for why objects are in the same cluster A clustering algorithm requires: –A data set D –A clustering description C –A clustering objective Obj(C) –An optimization method Opt(D) ~ C Obj measures the goodness of the best clustering C that Opt(D) can find

7
7 K-Means Clustering D = numeric d-dimensional data C = partitioning of data points into k clusters Obj(C) = Root Mean Squared Error (RMSE) –Average distance between each object and its cluster’s mean value Optimization Method 1.Select k random objects as the initial means 2.While the current clustering is different from the previous: 1.Move each object to the cluster with the closest mean 2.Re-compute the cluster means

8
8 K-Means Demo

9
K-Means Comments K-means has some randomness in its initialization, which means: –Two different executions on the same data, same number of clusters will likely have different results –Two different executions may have very different run-times due to the convergence test In practice, run multiple times and take result with the best RMSE 9

10
10 ___-Link Clustering 1.Initialize each object in its own cluster 2.Compute the cluster distance matrix M by the selected criterion (below) 3.While there is more than k clusters: 1.Join the clusters with the shortest distance 2.Update M by the selected criterion Criterion for ___-link clustering –Single-link: use the distance of the closest objects between two clusters –Complete-link: use the distance of the most distant objects between the two clusters

11
11 ___-Link Demo How can we measure the distance between these clusters? What is best for: –Spherical data (above)? –Chain-like data? Single-Link Distance Complete-Link Distance

12
___-Link Comments The –Link algorithms are not random in any way, which means: –You’ll get the same results whenever you use the same data and same number of clusters Choosing between these algorithms, and K-means (or any other clustering algorithm) requires lots of research, and careful analysis 12

13
13 My Research: Relational Data Clustering

14
14 The task of organizing objects into logical groups, or clusters, taking into account the relational links between objects Relational Data Clustering is:

15
15 Relational Data Formally: –A set of object domains –Sets of instances from those domains –Sets of relational tuples, or links between instances In Practice: –“Relational data” refers only to data that necessitates the use of links –Information not encoded using a relation is referred to as an attribute Spaces: –Attribute space = Ignore relations –Relation space = Ignore attributes People NameGender SallyF FredM JoeM Friends SallyFred Joe {Sally,F}{Joe,M} {Fred,M}

16
16 Block Models A block model is a partitioning of the links in a relation –Reorder the rows and columns of an adjacency matrix by cluster label, place boundaries between clusters Block b ij : Set of edges from cluster i to cluster j (also referred to as a block position for a single link) If some are dense, and the rest are sparse, we can generate a summary graph Block modeling is useful for both visualization and numerical analysis 1 2 3 1 2 3 1 3 2 0.9 0.5 0.1 0.3 0.8

17
17 Two Relational Clustering Algorithms Community Detection Maximizes connectivity within clusters and minimizes connectivity between clusters Intuitive concept that links identify classes Equivalent to maximizing density only on the diagonal blocks Faster than more general relational clustering approaches Stochatic Block Modeling Maximizes the likelihood that two objects in the same cluster have the same linkage pattern –Linkage may be within, or between clusters Subsumes community detection Equivalent to maximizing density in any block, rather than just the diagonal Generalizes relational clustering

18
18 My Work: Block Modularity General block-model-based clustering approach Models relations only Motivated by poor scalability of stochastic block modeling –Would be useful to have a block modeling approach that scales as well as community detection algorithms Contributions: –A clearly defined measure of general relational structure (block modularity) –An Iterative clustering algorithm that is much faster than prior works

19
19 Relational Structure What is “structure” –High level: non-randomness –Relational structure: non-random connectivity pattern A relation is structured if its observed connectivity pattern is clearly distinguished from that of a random relation

20
20 Approach Overview Assume that there exists a “model” random relation: In contrast, for any non-random relation: –There should exist at least one clustering that distinguishes this relation from the random block model: Random Clustering Structure- Identifying Clustering Any clustering of this relation will have a similar block model Structure-Based Clustering Requires: 1. Means of comparing relational structures 2. Definition of a “model” random relation 3. Method for finding the most structure identifying clustering

21
21 Comparing Structure: Block Modularity Given an input relation, a model random relation*, and a structure-identifying clustering, we compute block modularity: 1.Find the block model for each relation: 2.Compute the absolute difference of the number of links in each block: 3.Compute the sum of all the cells in the difference matrix: 158 4.(Optional) Normalize value by twice the number of links: 0.4389 6000 33918 01446 20 4020 13112 20626 Input Relation Model Random Relation *Required: the model random relation should have the same number of links as the input relation

22
22 Finding a Structure-Identifying Clustering ( Or, Clustering With Block Modularity ) Referred to as BMOD for brevity

23
23 Experimental Evaluation Three algorithms: –Block modularity optimization (BMOD) –Stochastic Block Modeling (SBM), optimized using simulated annealing –Newman’s (2007) configuration mixture model (CMM), optimized with the Expectation Maximization algorithm Data Sets: –Artificial data –Internet Movie Data Base (Actor Collaboration Network) –Protein interaction network –Citeseer citation network Metrics: –SBM’s negative log-likelihood –Runtime Hypothesis: BMOD will produce a similar quality block model while being much faster than SBM and CMM.

24
24 Block Modularity Clustering Results

25
25 Why Is BMOD Fast? SBM BMOD: number of iterations CMM BMOD: –Number of iterations CMM appears to perform poorly on sparse data sets –Cost of iterations BMOD’s iteration cost decreases as it nears convergence ArtificialIMDBProteinCiteseer SBM72112420210100050556 CMM141564108 BMOD411 6 Number of Iterations Before Convergence

26
26 Conclusion Fast and effective when compared to stochastic block modeling Iterative, and requires some basic counting mechanisms –Much simpler and less error-prone than implementing a stochastic algorithm –Fewer mathematical prerequisites makes the algorithm accessible to more programmers A measure of structure, not just an identifier, and its value can be used for other applications

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google