March, 2011 I3.1 Noise-Aware Data Mining in Information Networks Xifeng Yan University of California at Santa Barbara INARC.

March, 2011 I3.1 Noise-Aware Data Mining in Information Networks Xifeng Yan University of California at Santa Barbara INARC

INARC PI Report  Involvement – I2.1: In-Network Storage – I2.2: Large-Scale Information Network Processing – I3.1: QoI Mining of Noisy, Volatile, Uncertain, and Incomplete Heterogeneous Information Networks – I3.2: Modeling and Mining of Text-Rich Information Networks – E1.2: Composite Network Modeling with Composite Graphs  Objective – Concepts, Models, Theories, Methods, and Systems for measuring and operating Information Networks and Others – Concepts and Models in Noise-aware data mining of information networks  Collaborators: – Z. Wen IBM/SCNARC, J. Bao RPI/IRC, J. Han UIUC/INARC, M. Srivatsa IBM/INARC, V. Kawadia BBN/IRC, S. Desai Army 2

I3.1: Noise-Aware Mining: Graph Iceberg lR 1 has high concentration of black vertices, but low connectivity lR 2 contrarily has few black vertices, but well-connected; lR 3, is an anomaly region with high density of black vertices and high connectivity Nan Li, et al., Towards Iceberg Analysis in Graph OLAP in preparation for VLDB Journal l Find abnormal high density of intrusions in a network (targeted attack) l Find online communities where sensitive topics appear abnormally high (extremist groups) l Help us to study why it happens 3

I3.1: Noise-Aware Mining: Graph Iceberg  Huge search space –If we confine the size of the regions to be s, the total number of regions in a graph with n vertices is O(n s );  Our method –Find promising vertices first –Cluster these vertices to find the communities.  Promising vertices –Aggregate the personalized page rank score of neighbors where the event takes place –High Value => Good vertices Nan Li, et al., Towards Iceberg Analysis in Graph OLAP in preparation for VLDB Journal 4

gIceberg: PPV-Based Aggregation  Personalized PageRank vector (PPV) aggregation –Use PPV to measure the local closeness of two vertices  Local clustering algorithms –Query-aggregated personalized PageRank score (PPS)  Personalized PageRank approximation –Random-walk based Sampling –Pair-wise PPV formula –Active Boundary 5

I3.1: Noise-Aware Mining: Graph Iceberg  Our model: aggregated personalized page rank + sampling –10-50 times faster  A novel graph mining framework –find anomaly regions in large heterogeneous information networks –Noise-aware mining: It is an aggregate measure, which can easily overcome noise  The first-of-its kind in network science Nan Li, et al., Towards Iceberg Analysis in Graph OLAP in preparation for VLDB Journal 6

I3.1: Noise-Aware Mining: Structural Correlation  A novel metric, Decayed Hitting Time, is proposed to assess and rank structural correlations  SIGMOD reviewer: “Interesting problem that I haven’t seen before”  The first-of-its kind defined for networks  Sampling algorithm: 10-20 times faster  An aggregate measure: noise-resistant Question: Is the distribution of events (blue nodes) influenced by the network links or not? If it is, to which degree? (UCSB) Z. Guan, J. Wu, Z. Yun, A. Singh, X. Yan, Assessing and Ranking Structural Correlations in Graphs, Proc. 2011 Int. Conf. on Management of Data (SIGMOD'11), 2011 7

What is structural correlation?  Real world networks not only contain nodes and edges, but also have events (attributes) – Information network: events, documents, etc. –Social network: blog posts, rumors, opinions, online shopping, etc. –Virus/Malware infections  Virus propagation through computer networks, email network, or facebook. Which one is the main channel for a specific virus/malware?  Some events are correlated to network links, while others just occur randomly 8

Correlation Metric in Information Networks Question: Is the distribution of events (blue nodes) influenced by the network links or not? If it is, to which degree?  Help understand the distribution of events in networks  Help detect viral influence in the underlying network –Correlation has to do with link type, event type and time Why measuring such correlation? 9

How to measure? 10  If correlated, black nodes tend to stick together.  A naïve approach: only look at neighborhood  General idea: compute the aggregated proximity among black nodes, which will be noise- resistant

Measure definition  The measure –V q : the set of nodes having event q; s(*) can be any graph proximity measure  We choose hitting time since it treats as a whole (compared to personalized PageRank and shortest distance, etc.) 11

Hitting time  The expected number of steps to reach a target node via random walk: –B: target node set; Pr(T B =t|x 0 =v i ): the probability that we start from v i and reach B after t steps B vivi 12

Decayed Hitting Time (DHT)  Hitting time can be infinite  To better and faster calculate proximity, we propose using Decayed Hitting Time –Mapping [1,∞) to [0,1], high value means high proximity –Emphasizing the importance of short paths and reducing the impact of long paths –Facilitating approximation of DHT 13

DHT sampling approximation  Perform c simulated random walks from v i  Two strategies: –In each random walk, stop when we hit a target node. Get an estimate  May never stop  In large graph, can be time consuming –In each random walk, stop when we hit a target node, or the maximum number of steps (denoted by s) is reached. Get an approximation to 14

Bounds for Sampling Approximation  Suppose we have random walks hitting a target, and which reach s steps (not hit)  For each random walk in those, its contribution to is upper bounded by and lower bounded by 0  Bounds for are 15

From measure to significance  Consider a randomly select set of m nodes:, where –As m increases, randomly selected m nodes tend to close to one another (actually, monotonic increase of can be proved) –Just relying on is not enough, we should assess the deviation of to random cases  An approximation method for significance 16

Estimating Sampling: Randomly sample c node sets of size m and estimate their ρ values (also by sampling). Then take the sample mean as an estimate of  An approximation method by geometric distribution –When generating, each node has probability m/n to be chosen –Relaxing: each node is chosen independently –Start from a node, the probability that the random walk hits a target node after t steps is, where. By definition of DHT: 17

Estimating  Also use Sampling –Sample node sets of size m and estimate their ρ values. Then compute the sample variance –Since we assume each in the definition of is independent, we have –Thus, we sample pairs and estimate their DHTs and compute sample variance 18

Experiments - Datasets  DBLP –Co-author network –Events: keywords in paper titles –815,940 nodes, 2,857,960 edges and 171,614 events  TaoBao –Online shopping data, friend network –Events: products –794,001 nodes, 1,370,284 edges, 100 typical products  Twitter –40 million nodes and 1.4 billion edges 19

Experiments - Efficiency 20

Experiments - Effectiveness (TaoBao) 21

Experiments – Correlation Evolution (TaoBao) 22

Collaborations  Collaborations with researchers in other networks – (E1.2)Work with Prithwish Basu (BBN) on Network Design – (I2.1) Work with Arun Iyengar and Mudahakar Srivatsa (IBM), who has done much work on DTN and Storage, on building connection between informaiton network processing on DTN and Clusters. Shengqi Yang will work on it in IBM this summer. – (T2.3) Work with Vikas Kawadia (BBN), on using graph query processing for distributed trust computing. Ziyu Guan is collaborating with Vikas – (S1.1) Work with Zhen Wen (IBM), on the social network application of graph density indexing. – (E1.1) Work with Jie Bao (RPI), on RDF queries using neighborhood-based graph search. – (I3.1, I3.2) Work with Jiawei Han (UIUC) on graph mining – Work with Sachi Desai (Army) on graph query language/system 23

Research Papers  A. Khan, N. Li, Z. Guan, X. Yan, S. Chakraborty, and S. Tao, Neighborhood Based Fast Graph Search in Large Networks, Proc. 2011 Int. Conf. on Management of Data (SIGMOD'11), 2011.  Z. Guan, J. Wu, Z. Yun, A. Singh, X. Yan, Assessing and Ranking Structural Correlations in Graphs, Proc. 2011 Int. Conf. on Management of Data (SIGMOD'11), 2011.  Qiang Qu, Feida Zhu, Xifeng Yan, Jiawei Han, Philip S. Yu, and Hongyan Li, “Efficient Topological OLAP on Information Networks", DASFAA'11.  Yizhou Sun et al., PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks, submitted to VLDB’2011  Nan Li, et al., Towards Iceberg Analysis in Graph OLAP, to be submitted to VLDB Journal 24

Research Papers  Gengxin Miao, Ziyu Guan, Louise Moser, Xifeng Yan, Shu Tao, Nikos Anerousis, Latent Association Analysis of Document Pairs, submitted to SIGKDD’2011  Ziyu Guan et al., Diffusion through Co-occurrence Relationships for Expert Search on the Web, submitted to SIGIR’2011  Shengqi Yang, Bo Zong, Arijit Khan, Ben Zhao, Xifeng Yan, Managing Large- Scale Graphs for Efficient Distributed Processing, submitted to VLDB 2011  Nan Li, Arijit Khan, Xifeng Yan, and Zhen Wen, Density Index and Proximity Search on Large Graphs, to be submitted to VLDB Journal  C. C. Aggarwal, A. Khan, X. Yan. A Probabilistic Index for Massive and Dynamic Graph Streams, Research Report, to be submitted to VLDB Journal 25

Next Six Months and Path Ahead to 2012  Continue research – Composite Network Modeling and Design – Large-scale Information Network Processing – Information Network in DTN – Information Network Mining and Measuring with Noise and Dynamics  Structural Correlation in Dynamic Situations  Mining Graph Patterns in a Noise Environment  Node Mining and Inference for Multiple Information Networks (QoI) – Information Network Modeling with Text – Graph Query Language – Information Network Query Engine 26

Brief Summary of My Team’s Work in Other Tasks. 27

I2.2: Graph Partition for Distributed Graph Computing 28 Shengqi Yang, et al., Managing Large-Scale Graphs for Efficient Distributed Processing submitted to VLDB 2011  Are typical techniques efficient for graph queries?  Graph partitioning and distribution techniques (e.g., Pregel) Limitations: –Unbalanced workload due to skewed uniformly distributed graph queries. –Communication overhead due to inter ‐ machine (cross partition) communication.  Goal – Model-based Graph Partitioning Techniques – First-of-Its Kind Distributed Graph Computing Platform in public for Information, Social, and Communication Networks

I2.1: Adapt Sedge to DTN  Master –Vertex->Partition Map + Network Contact Graph + Route Table (opportunistic path)  Worker –Message Queue (Cache)  Superstep = Time slot –e.g. 1 min, 1 hour, 1 day, etc. 1 1 2 2 3 3 4 4 5 5 6 6 P1 P2 P3 P4 P5 P6 Contact Graph Cluster Connection 29

I2.2 gDensity: Model-Based Indexing  Problem definition (labeled proximity search) –Label-based graph proximity search, seeks to find the top-k vertex subsets with the smallest diameters, for a given query of distinct labels. Each subset must cover all the labels specified in the query. 30 Nan Li et al., Density Index and Proximity Search on Large Graphs, to be submitted to VLDB Journal  10 – 300 times faster  Using probabilistic model to build index

31  Align two networks Linked InFacebook I2.2: Graph Search: the Model-Based Approach  Ideas  Use information propagation model to propagate labels in information networks  Convert vertices to vectors  Align sets of vectors  Query Speed: 0.1 sec for WebGraph:10M vertices, 213M edges Information Propagation Model A. Khan et al., Neighborhood Based Fast Graph Search in Large Networks, SIGMOD’11

I3.2: Progressive Network Analysis for Expert Search  Goal: find and rank people who have expertise described by user query  Web pages are more noisy, contain spam compared to corpus in an enterprise. Both relevance and reputation should be considered  Use a heterogeneous hypergraph to model the co-occurrence relationships among people and words and devise a heat diffusion model on the hyerpgraph  Applied to 0.5B web pages  Accuracy: 50%-200% improvement than the leading language model methods. Significantly overcome noises in the Web. Ziyu Guan, et al., “Diffusion through Co-occurrence Relationships for Expert Search on the Web”, SIGIR’11 (sub) 32

I3.2: Latent Association Analysis of Document Pairs  Latent Association Analysis (LAA) mines the topics of two document sets simultaneously, taking the bipartite network between two document sets into consideration  One of the first attempts to analyze the topic structures of two connected document sets, aiming to infer their mapping network model  LAA significantly outperforms existing algorithms with 70% accuracy improvement Topic Simplex for Corpus 1 ? Topic Simplex for Corpus 2 ? Correlation Factor …… Document Pairs Gengxin Miao, et al., “Latent Association Analysis of Document Pairs”, KDD’11 (sub) 33

E1.2: Collaborative Network Modeling And Inference 34 Questions: 1.How to model it? 2.How information flows among different agents? 3.How agents interact with each other? 4.How to measure the quality of the flow? 5.Is there any mis-interaction among these agents? 6.Can we identify the role of the agents? 7.Can we identify the relationship? 8.Can we identify the weak components?

March, 2011 I3.1 Noise-Aware Data Mining in Information Networks Xifeng Yan University of California at Santa Barbara INARC.

Similar presentations

Presentation on theme: "March, 2011 I3.1 Noise-Aware Data Mining in Information Networks Xifeng Yan University of California at Santa Barbara INARC."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

March, 2011 I3.1 Noise-Aware Data Mining in Information Networks Xifeng Yan University of California at Santa Barbara INARC.

Similar presentations

Presentation on theme: "March, 2011 I3.1 Noise-Aware Data Mining in Information Networks Xifeng Yan University of California at Santa Barbara INARC."— Presentation transcript:

Similar presentations

About project

Feedback