GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington.

Slides:



Advertisements
Similar presentations
The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics.
Advertisements

Data Mining Classification: Alternative Techniques
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Clustering Basic Concepts and Algorithms
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial.
Data Mining Techniques: Clustering
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
FLAIRS '991 Applying the SUBDUE Substructure Discovery System to the Chemical Toxicity Domain Ravindra N. Chittimoori, Diane J. Cook, Lawrence B. Holder.
01 -1 Lecture 01 Artificial Intelligence Topics –Introduction –Knowledge representation –Knowledge reasoning –Machine learning –Applications.
Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.
Structural Web Search Using a Graph-Based Discovery System Nitish Manocha, Diane J. Cook, and Lawrence B. Holder University of Texas at Arlington
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane.
COM (Co-Occurrence Miner): Graph Classification Based on Pattern Co-occurrence Ning Jin, Calvin Young, Wei Wang University of North Carolina at Chapel.
Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial.
Graph-Based Concept Learning Jesus A. Gonzalez, Lawrence B. Holder, and Diane J. Cook Department of Computer Science and Engineering University of Texas.
Structural Knowledge Discovery Used to Analyze Earthquake Activity Jesus A. Gonzalez Lawrence B. Holder Diane J. Cook.
Research Related to Real-Time Strategy Games Robert Holte November 8, 2002.
Learning From Data Chichang Jou Tamkang University.
Graph-Based Data Mining Diane J. Cook University of Texas at Arlington
FLAIRS Graph-Based Concept Learning Jesus Gonzalez, Lawrence Holder and Diane Cook Department of Computer Science and Engineering The University.
Tree Clustering & COBWEB. Remember: k-Means Clustering.
Force Directed Algorithm Adel Alshayji 4/28/2005.
GUI implementation for Supervised and Unsupervised SUBDUE System.
Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington
Workshop1 Efficient Mining of Graph-Based Data Jesus Gonzalez, Istvan Jonyer, Larry Holder and Diane Cook University of Texas at Arlington Department.
Evaluating Performance for Data Mining Techniques
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Data Mining Chun-Hung Chou
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
1 SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS Jesus A. Gonzalez Supervisor:Dr. Lawrence B. Holder Committee:Dr. Diane J. Cook Dr. Lynn.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
GATree: Genetically Evolved Decision Trees 전자전기컴퓨터공학과 데이터베이스 연구실 G 김태종.
There are many kinds of animals. Animals can be put into groups based upon their characteristics. Some animals do not have backbones.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Invertebrates don’t have a backbone Insert video 1 (invertebrates)
tch?v=Y6ljFaKRTrI Fireflies.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
Animal Classification s. There are five different ways we can class animals...
Animal Groups Your Name.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Dichotomous Keys Lesson Goal: Students will be able to…
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Data Mining and Decision Support
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Selected Topics in Data Networking Explore Social Networks:
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
Gspan: Graph-based Substructure Pattern Mining
Improving Parallelism in Structural Data Mining Min Cai, Istvan Jonyer, Marcin Paprzycki Computer Science Department, Oklahoma State University, Stillwater,
Introduction to Machine Learning, its potential usage in network area,
Rule Induction for Classification Using
Data Mining Jim King.
Vertebrates There are five groups of vertebrates.
John Nicholas Owen Sarah Smith
Fuzzy Clustering.
for Vision-Based Navigation
Data Mining CSCI 307, Spring 2019 Lecture 24
Presentation transcript:

GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

Outline What is hierarchical conceptual clustering? Overview of Subdue Conceptual clustering in Subdue Evaluation of hierarchical clusterings Experiments and results Conclusions

What is clustering?

What is hierarchical conceptual clustering? Unsupervised concept learning Generating hierarchies to explain data Applications – Hypothesis generation and testing – Prediction based on groups – Finding taxonomies

Example hierarchical conceptual clustering Animals BodyTemp: unregulated HeartChamber: four BodyTemp: regulated Fertilization: internal Fertilization: external Name: mammal BodyCover: hair Name: bird BodyCover: feathers Name: reptile BodyCover: cornified-skin HeartChamber: imperfect-four Fertilization: internal Name: fish BodyCover: scales HeartChamber: two Name: amphibian BodyCover: moist-skin HeartChamber: three

The Problem Hierarchical conceptual clustering in discrete-valued structural databases Existing systems: – Continuous-valued – Discrete but unstructured – We can do better! (Field under explored)

Related Work Cobweb Labyrinth AutoClass Snob In Euclidian space: Chameleon, Cure Unsupervised learning algorithms

The Solution Take Subdue and extend it!

Overview of Subdue Data mining in graph representations of structural databases A C BD A C BD F E f c b a d e a b c g

Overview of Subdue Iteratively searching for best substructure by MDL heuristic A C BD c b a

Overview of Subdue Compress using best substructure S S F E f d e g

Overview of Subdue Fuzzy match – Inexact matching of subgraphs – Applications: Defining fuzzy concepts Evaluation of clusterings

Conceptual Clustering with Subdue Use Subdue to identify clusters – The best subgraph in an iteration defines a cluster When to stop within an iteration? 1) Use –limit option 2) Use –size option 3) Use first minimum heuristic (new)

The First Minimum Heuristic Use subgraph at first local minimum – Detect it using –prune2 option

The First Minimum Heuristic Not a greedy heuristic! – Although first local minimum is usually the global minimum – First local minimum is caused by a smaller, more frequently occurring subgraph – Subsequent minima are caused by bigger, less frequently occurring subgraphs => First subgraph is more general

The First Minimum Heuristic A multi-minimum search space:

Lattice vs. Tree Previous work defined classification trees – Inadequate in structured domains Better hierarchical description: classification lattice – A cluster can have more than one parent – A parent can be at any level (not only one level above)

Hierarchical Clustering in Subdue Subdue can compress by a subgraph after each iteration Subsequent clusters may be defined in terms of previously defined clusters This results in a hierarchy

Hierarchical Conceptual Clustering of an Artificial Domain

Root

Evaluation of Clusterings Traditional evaluation: – Not applicable to hierarchical domains No known evaluation for hierarchical clusterings – Most hierarchical evaluations are anecdotal

New Evaluation Heuristic for Hierarchical Clusterings Properties of a good clustering: – Small number of clusters Large coverage  good generality – Big cluster descriptions More features  more inferential power – Minimal or no overlap between clusters More distinct clusters  better defined concepts

New Evaluation Heuristic for Hierarchical Clusterings Big clusters: bigger distance between disjoint clusters Overlap: less overlap  bigger distance Few clusters: averaging comparisons

Experiments and Results Validation in an artificial domain Validation in unstructured domains Comparison to existing systems Real world applications

The Animal Domain NameBody Cover Heart ChamberBody Temp.Fertilization mammalhairfourregulatedinternal birdfeathersfourregulatedinternal reptilecornified-skinimperfect-fourunregulatedinternal amphibianmoist-skinthreeunregulatedexternal fishscalestwounregulatedexternal animal hair mammal BodyCover Fertilization HeartChamber BodyTemp internalregulated Name four

Hierarchical Clustering of the Animal Domain Animals BodyTemp: unregulated HeartChamber: four BodyTemp: regulated Fertilization: internal Fertilization: external Name: mammal BodyCover: hair Name: bird BodyCover: feathers Name: reptile BodyCover: cornified-skin HeartChamber: imperfect-four Fertilization: internal Name: fish BodyCover: scales HeartChamber: two Name: amphibian BodyCover: moist-skin HeartChamber: three

Hierarchical Clustering of the Animal Domain by Cobweb animals amphibian/fish mammal/bird reptile mammalbird fishamphibian

Comparison of Subdue and Cobweb Quality of Subdue’s lattice (tree): 2.60 Quality of Cobweb’s tree: 1.74 Therefore Subdue is better Reasons for a higher score: – Better generalization resulting in less clusters – Eliminating overlap between (reptile) and (amphibian/fish)

Chemical Application: Clustering of a DNA sequence

Coverage – 61% – 68% – 71% DNA O | O == P — OH C — NC — C \ O | O == P — OH | O | CH 2 C \ N — C \ C O \ C / \ C — C N — C / \ O C

Conclusions Goal of hierarchical conceptual clustering of structured databases was achieved Synthesized classification lattice Developed new evaluation heuristic for hierarchical clusterings Good performance in comparison to other systems, even in unstructured domains

Future Work More experiments on real-world domains Comparison to other systems Incorporation of evaluation tool into Subdue