Presentation is loading. Please wait.

Presentation is loading. Please wait.

Workshop1 Efficient Mining of Graph-Based Data Jesus Gonzalez, Istvan Jonyer, Larry Holder and Diane Cook University of Texas at Arlington Department.

Similar presentations


Presentation on theme: "Workshop1 Efficient Mining of Graph-Based Data Jesus Gonzalez, Istvan Jonyer, Larry Holder and Diane Cook University of Texas at Arlington Department."— Presentation transcript:

1 CSE@UTASRL Workshop1 Efficient Mining of Graph-Based Data Jesus Gonzalez, Istvan Jonyer, Larry Holder and Diane Cook University of Texas at Arlington Department of Computer Science and Engineering http://cygnus.uta.edu/subdue

2 CSE@UTASRL Workshop2 Motivation Structural/relational data Ease of graph representation

3 CSE@UTASRL Workshop3 Graph-Based Discovery object triangle R1 C1 T1 B1 T2 B2 T3 B3 T4 B4 Input DatabaseSubstructure S1 (graph form) Compressed Database R1 C1 object square on shape

4 CSE@UTASRL Workshop4 Algorithm 1. Create substructure for each unique vertex label Substructures: triangle (4), square (4), circle (1), rectangle (1) circle rectangle triangle square on triangle square on triangle square on triangle square on

5 CSE@UTASRL Workshop5 Algorithm 2. Expand best substructure by an edge or edge+neighboring vertex Substructures: triangle square on rectangle square on rectangle triangle on circle rectangle triangle square on triangle square on triangle square on triangle square on rectangle circle on

6 CSE@UTASRL Workshop6 Algorithm 3. Keep only best beam-width substructures on queue 4. Terminate when queue is empty or #discovered substructures >= limit 5. Compress graph and repeat to generate hierarchical description Note: polynomially constrained

7 CSE@UTASRL Workshop7 Evaluation Metric Substructures evaluated based on ability to compress input graph Compression measured using minimum description length (DL) Best substructure S in graph G minimizes: DL(S) + DL(G|S)

8 CSE@UTASRL Workshop8 Examples

9 CSE@UTASRL Workshop9 Inexact Graph Match Some variations may occur between instances Want to abstract over minor differences Difference = cost of transforming one graph to isomorphism of another Match if cost/size < threshold

10 CSE@UTASRL Workshop10 Parallel/Distributed Discovery Divide graph into P partitions using Metis, distribute to P processors Each processor performs serial Subdue on local partition Broadcast best substructures, evaluate on other processors Master processor stores best global substructures Close to linear speedup

11 CSE@UTASRL Workshop11 Graph-Based Concept Learning One graph stores positive examples One graph stores negative examples Find substructure that compresses positive graph but not negative graph (PosEgsNotCovered) + (NegEgsCovered) Multiple iterations implements set- covering approach

12 CSE@UTASRL Workshop12 Concept-Learning Example object on triangle square shape

13 CSE@UTASRL Workshop13 Concept-Learning Results Chess endgames (19,257 examples) Black King is (+) or is not (-) in check 99.8% FOIL, 99.21% Subdue

14 CSE@UTASRL Workshop14 More Concept-Learning Results Tic-Tac-Toe endgames + is win for X (958 examples) 100% Subdue, 92.35% FOIL Bach chorales Musical sequences (20 sequences) 100% Subdue, 85.71% FOIL

15 CSE@UTASRL Workshop15 Graph-Based Clustering Iterate Subdue until single vertex Each cluster (substructure) inserted into a classification lattice Root

16 CSE@UTASRL Workshop16 Clustering Example: Animals NameBody Cover Heart ChamberBody Temp.Fertilization mammalhairfourregulatedinternal birdfeathersfourregulatedinternal reptilecornified-skinimperfect-fourunregulatedinternal amphibianmoist-skinthreeunregulatedexternal fishscalestwounregulatedexternal animal hair mammal BodyCover Fertilization HeartChamber BodyTemp internalregulated Name four

17 CSE@UTASRL Workshop17 Graph-Based Clustering Results Animals BodyTemp: unregulated HeartChamber: four BodyTemp: regulated Fertilization: internal Fertilization: external Name: mammal BodyCover: hair Name: bird BodyCover: feathers Name: reptile BodyCover: cornified-skin HeartChamber: imperfect-four Fertilization: internal Name: fish BodyCover: scales HeartChamber: two Name: amphibian BodyCover: moist-skin HeartChamber: three

18 CSE@UTASRL Workshop18 Cobweb Results Comparison of Subdue and Cobweb results Subdue lattice produced better generalization, resulting in less clusters at higher levels Subdue lattice identifies overlap between (reptile) and (amphibian/fish) animals amphibian/fish mammal/bird reptile mammalbird fishamphibian

19 CSE@UTASRL Workshop19 Clustering Example: DNA

20 CSE@UTASRL Workshop20 Graph-Based Clustering Results Coverage 61% 68% 71% DNA O | O == P — OH C — NC — C \ O | O == P — OH | O | CH 2 C \ N — C \ C O \ C / \ C — C N — C / \ O C

21 CSE@UTASRL Workshop21 Evaluation of Clusterings Traditional evaluation: Not applicable to hierarchical domains Does not make sense to compare clusters in different subtrees Not applicable to relational clusterings

22 CSE@UTASRL Workshop22 Properties of Good Clusterings Small number of clusters Large coverage  good generality Big cluster descriptions More features  more inferential power Minimal or no overlap between clusters More distinct clusters  better defined concepts

23 CSE@UTASRL Workshop23 New Evaluation Heuristic for Hierarchical Clusterings Clustering rooted at C with c children H i having |H i | instances H i,k distance() measured by inexact graph match Animals: SubdueCQ=2.6, CobwebCQ=1.7

24 CSE@UTASRL Workshop24 Graph-Based Data Mining: Application Domains Biochemical domains Protein data DNA data Toxicology (cancer) data Spatial-temporal domains Earthquake data Aircraft Safety and Reporting System Telecommunications data Program source code Web topology web_page hyperlink home … …

25 CSE@UTASRL Workshop25 Theoretical Analysis Galois lattice [Lequiere et al.] Conceptual graphs [Sowa et al.] PAC analysis [Jappy et al.]

26 CSE@UTASRL Workshop26 Graph-based Data Mining Pattern (substructure) discovery Hierarchical discovery Distributed discovery Concept learning Clustering Compression heuristic based on minimum description length

27 CSE@UTASRL Workshop27 Future Work Concept learning Theoretical analysis Comparison to ILP systems Clustering Classification lattice Hierarchical relational conceptual clustering evaluation metric Probabilistic substructures Domains: WWW, source code

28 CSE@UTASRL Workshop28 Subdue Source Code and Data http://cygnus.uta.edu/subdue


Download ppt "Workshop1 Efficient Mining of Graph-Based Data Jesus Gonzalez, Istvan Jonyer, Larry Holder and Diane Cook University of Texas at Arlington Department."

Similar presentations


Ads by Google