Presentation is loading. Please wait.

Presentation is loading. Please wait.

Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington

Similar presentations


Presentation on theme: "Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington"— Presentation transcript:

1 Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.eduhttp://www-cse.uta.edu/~cook

2 Data Mining “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data” [Frawley et al., 92]   Increasing ability to generate data   Increasing ability to store data

3 KDD Process

4 Approaches to Data Mining n Pattern extraction n Prediction / classification n Clustering DebtLoan No Loan 0.123 0.203 0.117 0.545 Income Debt<50 Income NOYES NO YES yesno <50 50- 100 >100<50 50- 100 >100

5 Substructure Discovery n Most data mining algorithms deal with linear attribute-value data n Need to represent and learn relationships between attributes

6 n Discovers repetitive substructure patterns in graph databases n Pattern extraction, classification, clustering n Serial and parallel / distributed versions n Applied to CAD circuits, telecom, DNA, and more n http://cygnus.uta.edu/subdue

7 object triangle Graph Representation n Input is a labeled graph n A substructure is connected subgraph n An instance of a substructure is a subgraph that is isomorphic to substructure definition R1 C1 T1 S1 T2 S2 T3 S3 T4 S4 Input DatabaseSubstructure S1 (graph form) Compressed Database R1 C1 object square on shape

8 MDL Principle n Best theory minimizes description length of data n Evaluate substructure based ability to compress DL of graph n Description length = DL(S) + DL(G|S)

9 Algorithm 1. Create substructure for each unique vertex label circle rectangle left triangle square on triangle square on triangle square on triangle square on left Substructures: triangle (4), square (4), circle (1), rectangle (1)

10 Algorithm 2. Expand best substructure by an edge or edge+neighboring vertex circle rectangle left triangle square on triangle square on triangle square on triangle square on left Substructures: triangle square on circle left square rectangle square on rectangle triangle on

11 Algorithm 3. Keep only best substructures on queue (specified by beam width) 4. Terminate when queue is empty or #discovered substructures >= limit 5. Compress graph and repeat to generate hierarchical description Note: polynomially constrained [IEEE Exp96]

12 Examples [Jair94]

13 Inexact Graph Match [JIIS95] n Some variations may occur between instances n Want to abstract over minor differences n Difference = cost of transforming one graph to make it isomorphic to another n Match if cost/size < threshold

14 Inexact Graph Match 12 AB a b 5 34 BA b aa b B  (1,3) 1 (1,4) 0 (1,5) 1 (1, ) 1 (2,4) 7 (2,5) 6 (2, ) 10 (2,3) 3 (2,5) 6 (2, ) 9 (2,3) 7 (2,4) 7 (2, ) 10 (2,3) 9 (2,4) 10 (2,5) 9 (2, ) 11 Least-cost match is {(1,4), (2,3)}

15 Background Knowledge [IEEE TKDE96] n Some substructures not relevant n Background knowledge can bias search n Two types n Model knowledge n Graph match rules

16

17 Parallel/distributed Subdue [JPDC00] n Scalability issues n Three approaches n Dynamic partitioning n Functional parallel n Static partitioning

18 Dynamic Partitioning n Processor i stores ith vertex label n Each processor operates as in serial Subdue n Avoid replication by expanding to higher vertices v1v2 v3 v4 v5 e1 e2 e3e4

19 Dynamic Partitioning n Partitions are logical n Excessive processor idling and load balancing n Results very poor

20 Functional Parallel n Master processor controls search queue n Slaves evaluate and expand substructures n Synchronization after each step

21 Functional Parallel Results n ART database: 1,000 vertices and 2,000 edges n CAD database: 8,441 vertices and 19,206 edges

22 Static Partitioning n Divide graph into P partitions, distribute to P processors n Each processor performs serial Subdue on local partition n Broadcast best substructures, evaluate on other processors n Master processor stores best global substructures

23 Static Partitioning Results n Close to linear speedup n Continue until #processors > #vertices

24 Speedup Comparison

25 Issues n When partition graph, lose information n Metis graph partitioning system n Quality of resulting substructures? n Recapture by overlap, multiple partitions n Evaluating more substructures globally

26 Compression Results

27 Recapture Lost Information n Allow overlap between partitions n Run twice with two partitions, max results

28 Recapture Lost Information

29 AutoClass n Linear representation n Fit possible probabilistic models to data n Satellite data, DNA data, Landsat data

30 S UBDUE /AutoClass Combined Data structural features structural patterns Classes linear features = Combination of linear data or addition of linear features Subdue AutoClass + +

31 Example - 30 2-color squares n AutoClass Rep - tuple for each line (x1, y1, x2, y2, angle, length, color) n Add structure (neighboring edge information) n Subdue Rep - each line is node in graph, edges between connecting lines n Attributes from nodes

32 Results n AutoClass (12 classes) n Subdue (top substructure) Class 0 (20): Color=green, LineNo=Line1=Line2=98 +/- 10 Class 1 (20): Color=red, LineNo=Line1=Line2=99 +/- 10 … Class 11 (3): Line2=1 +/-13, Color=green

33 Combined Results n Combine 4 entries for each square into one n 30 tuples (one for each square) n Discover Class 0 (10): Color1=red, Color2=red, Color3=green, Color4=green Class 1 (10): Color1=green, Color2=green, Color3=blue, Color4=blue Class 2 (10): Color1=blue, Color2=blue, Color3=red, Color4=red

34 More Results

35 Supervised S UBDUE [IEEE IS00] n One graph stores positive examples n One graph stores negative examples n Find substructure that compresses positive graph but not negative graph

36 Example object on triangle square shape

37 Results n Chess endgames (19,257 examples), BK is (+) or is not (-) in check n 99.8% FOIL, 99.77% C4.5, 99.21% Subdue

38 More Results n Tic Tac Toe endgames n + is win for X (958 examples) n 100% Subdue, 92.35% FOIL, 96.03% C4.5 n Bach chorales n Musical sequences (20 sequences) n 100% Subdue, 85.71% FOIL, 82.00% C4.5

39 Clustering Using S UBDUE n Iterate Subdue until single vertex n Each cluster (substructure) inserted into a classification lattice n Early results similar to COBWEB [Fisher87] Root

40 Discovery Application Domains n Biochemical domains n Protein data [PSB99, IDA99] n Human Genome DNA data n Toxicology (cancer) data n Spatial-temporal domains n Earthquake data n Aircraft Safety and Reporting System n Telecommunications data n Program source code

41 Structured Web Search [AAAI-AIWS00] n Existing search engines use linear feature match n Subdue searches based on structure n Incorporation of WordNet allows for inexact feature match through synset path length n Technique n Breadth-first search through domain to generate graph n Nodes represent pages / documents n Edges represent hyperlinks n Additional nodes used to represent document keywords n Pose query as graph n Search for query match within domain graph

42 Sample Search Instructor Teaching Robotics Research Robotics Publicatio n Robotics http Postscript | PDF

43 Query: Find all pages which link to a page containing term ‘subdue’ Subgraph vertices: 1 _page_ URL: http://cygnus.uta.eduhttp://cygnus.uta.edu 7 _page_ URL: http://cygnus.uta.edu/projects.htmlhttp://cygnus.uta.edu/projects.html 8Subdue [1->7] hyperlink [7->8] word subdue page hyperlink /* Vertex ID Label */ s v 1 _page_ v 2 _page_ v 3 subdue /* Edge Vertex 1 Vertex 2 Label */ d 1 2 _hyperlink_ d 2 3 _word_ word page

44 Search for Presentation Pages n Subdue  22 instances n AltaVista  Query “host:www-cse.uta.edu AND image:next_motif.gif AND image:up_motif.gif AND image:previous_motif.gif.”  12 instances page hyperlink

45 Search for Reference Pages n Search for page with at least 35 in links n 5 pages in www-cse n AltaVista cannot perform this type of search page hyperlink …

46 Search for pages on ‘jobs in computer science’ n Inexact match: allow one level of synonyms n Subdue found 33 matches n Words include employment, work, job, problem, task n AltaVista found 2 matches page jobscomputerscience word

47 Search for ‘authority’ hub and authority pages n Subdue found 3 hub (and 3 authority) pages n AltaVista cannot perform this type of search n Inexact match applied with threshold = 0.2 (4.2 transformations allowed) n Subdue found 13 matches page hyperlink page word algorithms HUBS AUTHORITIES

48 Subdue Learning from Web Data n Distinguish professors’ and students’ web pages n Learned concept (professors have “box” in address field) n Distinguish online stores and professors’ web pages n Learned concept (stores have more levels in graph) pagebox word page

49 To Learn More cygnus.uta.edu/subdue cook@cse.uta.edu http://www-cse.uta.edu/~cook


Download ppt "Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington"

Similar presentations


Ads by Google