Frequent Subgraph Pattern Mining on Uncertain Graph Data

Slides:



Advertisements
Similar presentations
Mining for Tree-Query Associations in a Graph Jan Van den Bussche Hasselt University, Belgium joint work with Bart Goethals (U Antwerp, Belgium) and Eveline.
Advertisements

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
gSpan: Graph-based substructure pattern mining
Distance-Constraint Reachability Computation in Uncertain Graphs Ruoming Jin, Lin Liu Kent State University Bolin Ding UIUC Haixun Wang MSRA.
Zhou Zhao, Da Yan and Wilfred Ng
Efficient Query Evaluation on Probabilistic Databases
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
Indexing the imprecise positions of moving objects Xiaofeng Ding and Yansheng Lu Department of Computer Science Huazhong University of Science & Technology.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Rakesh Agrawal Ramakrishnan Srikant
Computability and Complexity 15-1 Computability and Complexity Andrei Bulatov NP-Completeness.
NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.
Gene Regulatory Networks - the Boolean Approach Andrey Zhdanov Based on the papers by Tatsuya Akutsu et al and others.
1 CSE 417: Algorithms and Computational Complexity Winter 2001 Lecture 23 Instructor: Paul Beame.
Chapter 11: Limitations of Algorithmic Power
The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.
Lecture 20: April 12 Introduction to Randomized Algorithms and the Probabilistic Method.
Chapter 11 Limitations of Algorithm Power Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Database k-Nearest Neighbors in Uncertain Graphs Lin Yincheng VLDB10.
Mining Graphs with Constrains on Symmetry and Diameter Natalia Vanetik Deutsche Telecom Laboratories at Ben-Gurion University IWGD10 workshop July 14th,
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,
Efficient Gathering of Correlated Data in Sensor Networks
Chapter 11 Limitations of Algorithm Power. Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples:
Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.
VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.
Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
ICDE 2012 Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data Yongxin Tong 1, Lei Chen 1, Bolin Ding 2 1 Department of Computer.
February 18, 2015CS21 Lecture 181 CS21 Decidability and Tractability Lecture 18 February 18, 2015.
Influence Maximization in Dynamic Social Networks Honglei Zhuang, Yihan Sun, Jie Tang, Jialin Zhang, Xiaoming Sun.
Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Jon Kleinberg (Cornell), Christos.
December 7-10, 2013, Dallas, Texas
Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.
1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.
CSCI 2670 Introduction to Theory of Computing November 29, 2005.
1 Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples: b number of comparisons needed to find the.
EMIS 8373: Integer Programming NP-Complete Problems updated 21 April 2009.
CSE332: Data Abstractions Lecture 24.5: Interlude on Intractability Dan Grossman Spring 2012.
Xiaowei Ying, Xintao Wu Univ. of North Carolina at Charlotte PAKDD-09 April 28, Bangkok, Thailand On Link Privacy in Randomizing Social Networks.
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
Mining Frequent Itemsets from Uncertain Data Presenter : Chun-Kit Chui Chun-Kit Chui [1], Ben Kao [1] and Edward Hung [2] [1] Department of Computer Science.
Mohamed Hefeeda 1 School of Computing Science Simon Fraser University, Canada Efficient k-Coverage Algorithms for Wireless Sensor Networks Mohamed Hefeeda.
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
Kijung Shin Jinhong Jung Lee Sael U Kang
CSCI 2670 Introduction to Theory of Computing December 2, 2004.
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
Hongyu Liang Institute for Theoretical Computer Science Tsinghua University, Beijing, China The Algorithmic Complexity.
 2005 SDU Lecture15 P,NP,NP-complete.  2005 SDU 2 The PATH problem PATH = { | G is a directed graph that has a directed path from s to t} s t
Subgraph Search Over Uncertain Graphs Erşan Demircioğlu.
Gspan: Graph-based Substructure Pattern Mining
Xiaowei Ying, Kai Pan, Xintao Wu, Ling Guo Univ. of North Carolina at Charlotte SNA-KDD June 28, 2009, Paris, France Comparisons of Randomization and K-degree.
Mining in Graphs and Complex Structures
Introduction to Randomized Algorithms and the Probabilistic Method
Probabilistic Data Management
NP-Completeness Yin Tat Lee
Chao Zhang1, Yu Zheng2, Xiuli Ma3, Jiawei Han1
Mining Frequent Itemsets over Uncertain Databases
Efficient Subgraph Similarity All-Matching
SEG5010 Presentation Zhou Lanjun.
NP-Completeness Yin Tat Lee
CSE 589 Applied Algorithms Spring 1999
Approximate Graph Mining with Label Costs
Presentation transcript:

Frequent Subgraph Pattern Mining on Uncertain Graph Data Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang Harbin Institute of Technology, China CIKM’09, Hong Kong Nov 4, 2009

Outline Background Problem Definition Algorithm Experimental Results Conclusions

Background Graph mining has played an important role in a range of real world applications. medicines: structures of molecules bioinformatics: biological networks technologies: WWW social science: social networks many others

Directions of Graph Mining Models of graphs e.g. [Leskovec et al. KDD’05] Patterns of graphs e.g., [Yan et al. ICDM’02] Uncertainties of graphs Privacy of graphs e.g., [Zou et al. VLDB’09] Evolution of graphs e.g., [Faloutsos et al. SIGMOD’07]

Uncertainties of Graphs: Example I Protein-Protein Interaction (PPI) Networks Vertices: proteins Edges: interactions between proteins Uncertainties: probabilities of interactions really existing TIF34 0.375 0.639 0.867 0.651 0.651 FET3 0.698 0.147 0.639 NTG1 SMT3 RAD59 RPC40 The data are taken from the STRING Database (http://string-db.org).

Uncertainties of Graphs: Example II Topologies of wireless sensor networks (WSNs) Vertices: sensor nodes Edges: wireless links between sensor nodes Uncertainties: probabilities of wireless links functioning at any given time 0.75 0.95 0.88 0.92 0.69

The Goal of This Paper Models of graphs e.g. [Leskovec et al. KDD’05] Patterns of graphs e.g., [Yan et al. ICDM’02] Uncertainties of graphs Privacy of graphs e.g., [Zou et al. VLDB’09] Evolution of graphs e.g., [Faloutsos et al. SIGMOD’07]

Outline Background Problem Definition Algorithm Experimental Results Conclusions

Preliminaries B A x y z graph G2 B A x y graph G1 Graph Database B x y Subgraph Pattern support = 1.0 support = 0.5 The support of S = the number of graphs containing S the total number of graphs

Frequent Subgraph Pattern Mining Problem Input: a graph database D, and a support threshold minsup Output: all subgraph patterns with support no less than minsup FSP mining on biological networks (e.g., PPI networks) is an important tool for discovering functional modules [Koyutürk et al. Bioinformatics 04, Turanalp et al. BMC Bioinformatics 08]. PPI networks are subject to uncertainties. How do we define support?

Model of Uncertain Graphs B A x y exist in this form Implicated Graph B A x y 0.5 0.6 0.7 0.8 (1 – 0.5) * 0.6 * 0.7 * 0.8 = 0.168 B A x y exist in this form Uncertain Graph 0.5 * (1 – 0.6) * 0.7 * 0.8 = 0.112

Model of Uncertain Graphs (Cont’d) Theorem: An uncertain graph represents a probability distribution over all its implicated graphs.

Uncertain Graph Databases x y 0.5 0.6 0.7 0.8 z 0.1 Uncertain graph G1 Uncertain graph G2 B A x y exist in this form Implicated graph of G1 Implicated graph of G2 Theorem: An uncertain graph DB represents a probability distribution over all its implicated graph DBs. Totally, 24 * 23 = 128 implicated graph databases. Implicated Graph Database ((1 – 0.5) * 0.6 * 0.7 * 0.8) * (0.8 * 0.1 * (1 – 0.7)) = 4.032 * 10-3

…… Expected Support D uncertain graph DB d1 d2 dn implicating p1 = Pr(D implicates d1) p2 = Pr(D implicates d2) pn = Pr(D implicates dn) s1 = support of S in d1 s2 = support of S in d2 sn = support of S in dn The expected support of S is

FSP Mining Problem on Uncertain Graphs Input: an uncertain graph database D, and an expected support threshold minsup Output: all subgraph patterns with expected support no less than minsup It is #P-hard to count the number of frequent subgraph patterns. Reduction from the problem of counting the number of satisfying truth assignments of a monotone k-CNF formula. The FSP mining problem on uncertain graphs is NP-hard.

Outline Background Problem Definition Algorithm Experimental Results Conclusions

Approximation Method It is #P-hard to compute the expected support of a subgraph pattern. We develop an approximation method to find an approximate set of frequent subgraph patterns. Let e (0 < e < 1) be a relative error tolerance. Discard Arbitrary Output expected support (1-e) minsup minsup 1

Objective I Difficulty I: # of frequent subgraph patterns is exponentially large. Objective I: Examine subgraph patterns as efficiently as possible to find all frequent ones.

Method for Objectives I Step 1: Build a search tree T of subgraph patterns. Step 2: Examine subgraph patterns in T in depth-first order If S is infrequent, then all its descendents can be pruned. B A x y 0.5 0.6 0.7 0.8 z 0.1 Uncertain graph G1 Uncertain graph G2 expected support minsup (1-e) minsup Output Discard Arbitrary 1

Objective II Difficulty II: It is #P-hard to compute the expected support esup(S) of a subgraph pattern S. Objective II: Make the following judgments without computing esup(S) exactly. If esup(S) is surely not in the green region, then discard. If esup(S) is probable to be in the green region and surely not in the red region, then output. expected support minsup (1-e) minsup Output Discard Arbitrary 1

Method for Objective II Step 1: Approximate esup(S) by an interval [l, u] such that esup(S)∈[l, u]. Step 2: Decide whether S can be output or not by testing the following conditions. expected support minsup (1-e) minsup 1 Output Discard Shrink

Approximating esup(S) by [l,u] A subgraph pattern S occurs in an uncertain graph G if S is contained in at least one implicated graph of G. Algorithm Approximate esup(S) by [l,u] Step 1: For each uncertain graph Gi in D, approximate Pr(S occurs in Gi) by an interval [li, ui] of width at most e*minsup. Step 2:

Approximate Pr(S occurs in Gi) by [li, ui] 0.5 0.6 0.7 0.8 uncertain graph Gi pattern S (x1) (x2) (x4) (x3) Step 1: Find all embeddings of S in Gi. 4 embeddings Step 2: Assign boolean variables to the edges in the embeddings. Pr(x1) = 0.5, Pr(x2) = 0.6, Pr(x3) = 0.7, Pr(x4) = 0.8. Step 3: Construct a conjunctive formula for each embedding. C1 = (x1 ^ x2), C2 = (x1 ^ x4), C3 = (x2 ^ x3), C4 = (x3 ^ x4). Step 4: Construct a DNF formula. F = C1 V C2 V C3 V C4. Step 5: Estimate Pr(F = TRUE) by p using Karp & Luby’s Markov-Chain Monte-Carlo method with absolute error e*minsup/2 and confidence d (d ∈[0,1]). Step 6: [li, ui] = [p - e*minsup/2, p + e*minsup/2].

Outline Background Problem Definition Algorithm Experimental Results Conclusions

Experimental Results Data The STRING Database (http://string-db.org)

Time Efficiency

Approximation Quality

Scalability

Conclusions A new model of uncertain graph data has been proposed. The frequent subgraph pattern mining problem on uncertain graph data has been formalized. The computational complexity of the problem has been formally proved to be NP-hard. An approximate mining algorithm has been proposed. The proposed algorithm has high efficiency, high approximation quality, and high scalability.

Thank you