 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,

 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial, and textual data sets.  The success of these techniques has renewed interest in applying them to various scientific and engineering fields. Astronomy Life Sciences Ecosystem Modeling Structural Mechanics …

 Most of existing data mining algorithms assume that the data is represented via Transactions (set of items) Sequence of items or events Multi-dimensional vectors Time series  Scientific datasets with structures, layers, hierarchy, geometry, and arbitrary relations can not be accurately modeled using such frameworks. e.g., Numerical simulations, 3D protein structures, chemical compounds, etc. Need algorithms that operate on scientific datasets in their native representation

 There are two basic choices. Treat each dataset/application differently and develop custom representations/algorithms. Employ a new way of modeling such datasets and develop algorithms that span across different applications!  What should be the properties of this general modeling framework? Abstract compared with the original raw data. Yet powerful enough to capture the important characteristics. Labeled directed/undirected topological/geometric graphs and hyper graphs

Graphs are suitable for capturing arbitrary relations between the various elements. VertexElement Element’s Attributes Relation Between Two Elements Type Of Relation Vertex Label Edge Label Edge Data InstanceGraph Instance Relation between a Set of Elements Hyper Edge Provide enormous flexibility for modeling the underlying data as they allow the modeler to decide on what the elements should be and the type of relations to be modeled

PDB; 1MWP N-Terminal Domain Of The Amyloid Precursor Protein Alzheimer's disease amyloid A4 protein precursor

 Develop algorithms to mine and analyze graph data sets. Finding patterns in these graphs Finding groups of similar graphs (clustering) Building predictive models for the graphs (classification)

Structural motif discovery High-throughput screening Protein fold recognition VLSI reverse engineering A lot more … Beyond Scientific Applications Semantic web Mining relational profiles Behavioral modeling Intrusion detection Citation analysis …

Approach #1: Frequent Subgraph Mining Find all subgraphs g within a set of graph transactions G such that where t is the minimum support Focus on pruning and fast, code-based graph matching

 Approach #1: Algorithms Apriori-based Graph Mining (AGM)  Inokuchi, Washio & Motoda (Osaka U., Japan) Frequent Sub-Graph discovery (FSG)  Kuramochi & Karypis (U. Minnesota) Graph-based Substructure pattern mining (gSpan)  Yan & Han (UIUC) Fast Frequent Subgraph Mining (FFSM), Spanning tree based maximal graph mining (Spin)  Huan, Wang & Prins (UNC Chapel Hill) Graph, Sequences and Tree extraction (Gaston)  Kazius & Nijssen (U. Leiden, Netherlands)

 A pattern is a relation between the object’s elements that is recurring over and over again. Common structures in a family of chemical compounds or proteins. Similar arrangements of vortices at different “instances” of turbulent fluid flows. …  There are two general ways to formally define a pattern in the context of graphs Arbitrary subgraphs (connected or not) Induced subgraphs (connected or not)  Frequent pattern discovery translates to frequent subgraph discovery…

 Candidate generation  Candidate pruning  Frequency counting  Key to FSG’s computational efficiency Simple operations become complicated & expensive when dealing with graphs…

Multiple candidates for the same core!

Multiple cores between two (k-1)-subgraphs

v0v0 B v1v1 B v2v2 B v3v3 B v4v4 A v5v5 A Label = “1 01 011 0001 00010” Label = “1 11 100 1000 01000”

Discover Frequent Sub-graphs 1 Select Discriminating Features 2 Learn a Classification Model 4 Transform Graphs in Feature Representation 3 Graph Databases

 Approach #2: Find subgraph S within a set of one or more graphs G that maximally compresses G where (G|S) is G compressed by S, i.e., instances of S in G replaced by single vertex  Focus on efficient subgraph generation and heuristic search

THE BASIC IDEA BEHIND THE GBI

PAIRWISE CHUNKING

 Graphs provide a powerful mechanism to represent relational and physical datasets.  Can be used as a quick prototyping tool to test out whether or not data-mining techniques can help a particular application area and problem.  Their benefits can be realized if there exists an extensive set of efficient and scalable algorithms to mine them…

 Takashi Matsuda, Hiroshi Motoda, Takashi Washio, Graph-based induction and its applications, Advanced Engineering Informatics, Volume 16, Issue 2, April 2002, Pages 135-143.  Michihiro Kuramochi, George Karypis, "Frequent Subgraph Discovery," Data Mining, IEEE International Conference on, pp. 313, First IEEE International Conference on Data Mining (ICDM'01), 2001.

 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,

Similar presentations

Presentation on theme: " Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,

Similar presentations

Presentation on theme: " Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,"— Presentation transcript:

Similar presentations

About project

Feedback