March, 2011 I3.1 Noise-Aware Data Mining in Information Networks Xifeng Yan University of California at Santa Barbara INARC.

Slides:



Advertisements
Similar presentations
Recommender System A Brief Survey.
Advertisements

Latent Association Analysis of Document Pairs Gengxin Miao University of California, Santa Barbara Presented at the IBM T.J. Watson Research Center Hawthorne,
Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Efficient Evaluation of k-Range Nearest Neighbor Queries in Road Networks Jie BaoChi-Yin ChowMohamed F. Mokbel Department of Computer Science and Engineering.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
N EIGHBORHOOD B ASED F AST G RAPH S EARCH I N L ARGE N ETWORKS Arijit Khan, Nan Li, Xifeng Yan, Ziyu Guan Computer Science UC Santa Barbara {arijitkhan,
Analysis and Modeling of Social Networks Foudalis Ilias.
Xiaowei Ying, Xintao Wu, Daniel Barbara Spectrum based Fraud Detection in Social Networks 1.
1 Social Influence Analysis in Large-scale Networks Jie Tang 1, Jimeng Sun 2, Chi Wang 1, and Zi Yang 1 1 Dept. of Computer Science and Technology Tsinghua.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
CIKM’2008 Presentation Oct. 27, 2008 Napa, California
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
On Community Outliers and their Efficient Detection in Information Networks Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Towards Scalable Critical Alert Mining Bo Zong 1 with Yinghui Wu 1, Jie Song 2, Ambuj K. Singh 1, Hasan Cam 3, Jiawei Han 4, and Xifeng Yan 1 1 UCSB, 2.
Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.
Honglei Zhuang1, Jing Zhang2, George Brova1,
Information Retrieval
Algorithms for Data Mining and Querying with Graphs Investigators: Padhraic Smyth, Sharad Mehrotra University of California, Irvine Students: Joshua O’
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
Overview of Web Data Mining and Applications Part I
Leveraging Big Data: Lecture 11 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo.
LDBC & The Social Network Benchmark Peter Boncz Database Architectures CWI Special chair “Large-Scale Data VU event.cwi.nl/lsde2015.
Chapter 20: Social Service Selection Service-Oriented Computing: Semantics, Processes, Agents – Munindar P. Singh and Michael N. Huhns, Wiley, 2005.
Advanced Topics NP-complete reports. Continue on NP, parallelism.
Outlier Detection Using k-Nearest Neighbour Graph Ville Hautamäki, Ismo Kärkkäinen and Pasi Fränti Department of Computer Science University of Joensuu,
March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.
Social Networking Algorithms related sections to read in Networked Life: 2.1,
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Computing & Information Sciences Kansas State University Laboratory for Knowledge Discovery in Databases PhD Research Proficiency Exam Jing.
Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011.
On Node Classification in Dynamic Content-based Networks.
Optimal Link Bombs are Uncoordinated Sibel Adali Tina Liu Malik Magdon-Ismail Rensselaer Polytechnic Institute.
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,
INARC Charu C. Aggarwal (I2 Contributions) Scalable Graph Querying and Indexing Task I2.2 Charu C. Aggarwal IBM Collaborators (across all tasks): Jiawei.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.
1 Panther: Fast Top-K Similarity Search on Large Networks Jing Zhang 1, Jie Tang 1, Cong Ma 1, Hanghang Tong 2, Yu Jing 1, and Juanzi Li 1 1 Department.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Data Structures and Algorithms in Parallel Computing
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
1 Finding Spread Blockers in Dynamic Networks (SNAKDD08)Habiba, Yintao Yu, Tanya Y., Berger-Wolf, Jared Saia Speaker: Hsu, Yu-wen Advisor: Dr. Koh, Jia-Ling.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Paper Presentation Social influence based clustering of heterogeneous information networks Qiwei Bao & Siqi Huang.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Presented by: Siddhant Kulkarni Spring Authors: Publication:  ICDE 2015 Type:  Research Paper 2.
Topics In Social Computing (67810) Module 1 Introduction & The Structure of Social Networks.
Purnamrita Sarkar (Carnegie Mellon) Andrew W. Moore (Google, Inc.)
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Cohesive Subgraph Computation over Large Graphs
MINING DEEP KNOWLEDGE FROM SCIENTIFIC NETWORKS
Summary Presented by : Aishwarya Deep Shukla
Probabilistic Data Management
Towards Effective Partition Management for Large Graphs
Integrating Meta-Path Selection With User-Guided Object Clustering in Heterogeneous Information Networks Yizhou Sun†, Brandon Norick†, Jiawei Han†, Xifeng.
CS7280: Special Topics in Data Mining Information/Social Networks
RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Jiawei Han Department of Computer Science
Example: Academic Search
Presentation transcript:

March, 2011 I3.1 Noise-Aware Data Mining in Information Networks Xifeng Yan University of California at Santa Barbara INARC

INARC PI Report  Involvement – I2.1: In-Network Storage – I2.2: Large-Scale Information Network Processing – I3.1: QoI Mining of Noisy, Volatile, Uncertain, and Incomplete Heterogeneous Information Networks – I3.2: Modeling and Mining of Text-Rich Information Networks – E1.2: Composite Network Modeling with Composite Graphs  Objective – Concepts, Models, Theories, Methods, and Systems for measuring and operating Information Networks and Others – Concepts and Models in Noise-aware data mining of information networks  Collaborators: – Z. Wen IBM/SCNARC, J. Bao RPI/IRC, J. Han UIUC/INARC, M. Srivatsa IBM/INARC, V. Kawadia BBN/IRC, S. Desai Army 2

I3.1: Noise-Aware Mining: Graph Iceberg lR 1 has high concentration of black vertices, but low connectivity lR 2 contrarily has few black vertices, but well-connected; lR 3, is an anomaly region with high density of black vertices and high connectivity Nan Li, et al., Towards Iceberg Analysis in Graph OLAP in preparation for VLDB Journal l Find abnormal high density of intrusions in a network (targeted attack) l Find online communities where sensitive topics appear abnormally high (extremist groups) l Help us to study why it happens 3

I3.1: Noise-Aware Mining: Graph Iceberg  Huge search space –If we confine the size of the regions to be s, the total number of regions in a graph with n vertices is O(n s );  Our method –Find promising vertices first –Cluster these vertices to find the communities.  Promising vertices –Aggregate the personalized page rank score of neighbors where the event takes place –High Value => Good vertices Nan Li, et al., Towards Iceberg Analysis in Graph OLAP in preparation for VLDB Journal 4

gIceberg: PPV-Based Aggregation  Personalized PageRank vector (PPV) aggregation –Use PPV to measure the local closeness of two vertices  Local clustering algorithms –Query-aggregated personalized PageRank score (PPS)  Personalized PageRank approximation –Random-walk based Sampling –Pair-wise PPV formula –Active Boundary 5

I3.1: Noise-Aware Mining: Graph Iceberg  Our model: aggregated personalized page rank + sampling –10-50 times faster  A novel graph mining framework –find anomaly regions in large heterogeneous information networks –Noise-aware mining: It is an aggregate measure, which can easily overcome noise  The first-of-its kind in network science Nan Li, et al., Towards Iceberg Analysis in Graph OLAP in preparation for VLDB Journal 6

I3.1: Noise-Aware Mining: Structural Correlation  A novel metric, Decayed Hitting Time, is proposed to assess and rank structural correlations  SIGMOD reviewer: “Interesting problem that I haven’t seen before”  The first-of-its kind defined for networks  Sampling algorithm: times faster  An aggregate measure: noise-resistant Question: Is the distribution of events (blue nodes) influenced by the network links or not? If it is, to which degree? (UCSB) Z. Guan, J. Wu, Z. Yun, A. Singh, X. Yan, Assessing and Ranking Structural Correlations in Graphs, Proc Int. Conf. on Management of Data (SIGMOD'11),

What is structural correlation?  Real world networks not only contain nodes and edges, but also have events (attributes) – Information network: events, documents, etc. –Social network: blog posts, rumors, opinions, online shopping, etc. –Virus/Malware infections  Virus propagation through computer networks, network, or facebook. Which one is the main channel for a specific virus/malware?  Some events are correlated to network links, while others just occur randomly 8

Correlation Metric in Information Networks Question: Is the distribution of events (blue nodes) influenced by the network links or not? If it is, to which degree?  Help understand the distribution of events in networks  Help detect viral influence in the underlying network –Correlation has to do with link type, event type and time Why measuring such correlation? 9

How to measure? 10  If correlated, black nodes tend to stick together.  A naïve approach: only look at neighborhood  General idea: compute the aggregated proximity among black nodes, which will be noise- resistant

Measure definition  The measure –V q : the set of nodes having event q; s(*) can be any graph proximity measure  We choose hitting time since it treats as a whole (compared to personalized PageRank and shortest distance, etc.) 11

Hitting time  The expected number of steps to reach a target node via random walk: –B: target node set; Pr(T B =t|x 0 =v i ): the probability that we start from v i and reach B after t steps B vivi 12

Decayed Hitting Time (DHT)  Hitting time can be infinite  To better and faster calculate proximity, we propose using Decayed Hitting Time –Mapping [1,∞) to [0,1], high value means high proximity –Emphasizing the importance of short paths and reducing the impact of long paths –Facilitating approximation of DHT 13

DHT sampling approximation  Perform c simulated random walks from v i  Two strategies: –In each random walk, stop when we hit a target node. Get an estimate  May never stop  In large graph, can be time consuming –In each random walk, stop when we hit a target node, or the maximum number of steps (denoted by s) is reached. Get an approximation to 14

Bounds for Sampling Approximation  Suppose we have random walks hitting a target, and which reach s steps (not hit)  For each random walk in those, its contribution to is upper bounded by and lower bounded by 0  Bounds for are 15

From measure to significance  Consider a randomly select set of m nodes:, where –As m increases, randomly selected m nodes tend to close to one another (actually, monotonic increase of can be proved) –Just relying on is not enough, we should assess the deviation of to random cases  An approximation method for significance 16

Estimating Sampling: Randomly sample c node sets of size m and estimate their ρ values (also by sampling). Then take the sample mean as an estimate of  An approximation method by geometric distribution –When generating, each node has probability m/n to be chosen –Relaxing: each node is chosen independently –Start from a node, the probability that the random walk hits a target node after t steps is, where. By definition of DHT: 17

Estimating  Also use Sampling –Sample node sets of size m and estimate their ρ values. Then compute the sample variance –Since we assume each in the definition of is independent, we have –Thus, we sample pairs and estimate their DHTs and compute sample variance 18

Experiments - Datasets  DBLP –Co-author network –Events: keywords in paper titles –815,940 nodes, 2,857,960 edges and 171,614 events  TaoBao –Online shopping data, friend network –Events: products –794,001 nodes, 1,370,284 edges, 100 typical products  Twitter –40 million nodes and 1.4 billion edges 19

Experiments - Efficiency 20

Experiments - Effectiveness (TaoBao) 21

Experiments – Correlation Evolution (TaoBao) 22

Collaborations  Collaborations with researchers in other networks – (E1.2)Work with Prithwish Basu (BBN) on Network Design – (I2.1) Work with Arun Iyengar and Mudahakar Srivatsa (IBM), who has done much work on DTN and Storage, on building connection between informaiton network processing on DTN and Clusters. Shengqi Yang will work on it in IBM this summer. – (T2.3) Work with Vikas Kawadia (BBN), on using graph query processing for distributed trust computing. Ziyu Guan is collaborating with Vikas – (S1.1) Work with Zhen Wen (IBM), on the social network application of graph density indexing. – (E1.1) Work with Jie Bao (RPI), on RDF queries using neighborhood-based graph search. – (I3.1, I3.2) Work with Jiawei Han (UIUC) on graph mining – Work with Sachi Desai (Army) on graph query language/system 23

Research Papers  A. Khan, N. Li, Z. Guan, X. Yan, S. Chakraborty, and S. Tao, Neighborhood Based Fast Graph Search in Large Networks, Proc Int. Conf. on Management of Data (SIGMOD'11),  Z. Guan, J. Wu, Z. Yun, A. Singh, X. Yan, Assessing and Ranking Structural Correlations in Graphs, Proc Int. Conf. on Management of Data (SIGMOD'11),  Qiang Qu, Feida Zhu, Xifeng Yan, Jiawei Han, Philip S. Yu, and Hongyan Li, “Efficient Topological OLAP on Information Networks", DASFAA'11.  Yizhou Sun et al., PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks, submitted to VLDB’2011  Nan Li, et al., Towards Iceberg Analysis in Graph OLAP, to be submitted to VLDB Journal 24

Research Papers  Gengxin Miao, Ziyu Guan, Louise Moser, Xifeng Yan, Shu Tao, Nikos Anerousis, Latent Association Analysis of Document Pairs, submitted to SIGKDD’2011  Ziyu Guan et al., Diffusion through Co-occurrence Relationships for Expert Search on the Web, submitted to SIGIR’2011  Shengqi Yang, Bo Zong, Arijit Khan, Ben Zhao, Xifeng Yan, Managing Large- Scale Graphs for Efficient Distributed Processing, submitted to VLDB 2011  Nan Li, Arijit Khan, Xifeng Yan, and Zhen Wen, Density Index and Proximity Search on Large Graphs, to be submitted to VLDB Journal  C. C. Aggarwal, A. Khan, X. Yan. A Probabilistic Index for Massive and Dynamic Graph Streams, Research Report, to be submitted to VLDB Journal 25

Next Six Months and Path Ahead to 2012  Continue research – Composite Network Modeling and Design – Large-scale Information Network Processing – Information Network in DTN – Information Network Mining and Measuring with Noise and Dynamics  Structural Correlation in Dynamic Situations  Mining Graph Patterns in a Noise Environment  Node Mining and Inference for Multiple Information Networks (QoI) – Information Network Modeling with Text – Graph Query Language – Information Network Query Engine 26

Brief Summary of My Team’s Work in Other Tasks. 27

I2.2: Graph Partition for Distributed Graph Computing 28 Shengqi Yang, et al., Managing Large-Scale Graphs for Efficient Distributed Processing submitted to VLDB 2011  Are typical techniques efficient for graph queries?  Graph partitioning and distribution techniques (e.g., Pregel) Limitations: –Unbalanced workload due to skewed uniformly distributed graph queries. –Communication overhead due to inter ‐ machine (cross partition) communication.  Goal – Model-based Graph Partitioning Techniques – First-of-Its Kind Distributed Graph Computing Platform in public for Information, Social, and Communication Networks

I2.1: Adapt Sedge to DTN  Master –Vertex->Partition Map + Network Contact Graph + Route Table (opportunistic path)  Worker –Message Queue (Cache)  Superstep = Time slot –e.g. 1 min, 1 hour, 1 day, etc P1 P2 P3 P4 P5 P6 Contact Graph Cluster Connection 29

I2.2 gDensity: Model-Based Indexing  Problem definition (labeled proximity search) –Label-based graph proximity search, seeks to find the top-k vertex subsets with the smallest diameters, for a given query of distinct labels. Each subset must cover all the labels specified in the query. 30 Nan Li et al., Density Index and Proximity Search on Large Graphs, to be submitted to VLDB Journal  10 – 300 times faster  Using probabilistic model to build index

31  Align two networks Linked InFacebook I2.2: Graph Search: the Model-Based Approach  Ideas  Use information propagation model to propagate labels in information networks  Convert vertices to vectors  Align sets of vectors  Query Speed: 0.1 sec for WebGraph:10M vertices, 213M edges Information Propagation Model A. Khan et al., Neighborhood Based Fast Graph Search in Large Networks, SIGMOD’11

I3.2: Progressive Network Analysis for Expert Search  Goal: find and rank people who have expertise described by user query  Web pages are more noisy, contain spam compared to corpus in an enterprise. Both relevance and reputation should be considered  Use a heterogeneous hypergraph to model the co-occurrence relationships among people and words and devise a heat diffusion model on the hyerpgraph  Applied to 0.5B web pages  Accuracy: 50%-200% improvement than the leading language model methods. Significantly overcome noises in the Web. Ziyu Guan, et al., “Diffusion through Co-occurrence Relationships for Expert Search on the Web”, SIGIR’11 (sub) 32

I3.2: Latent Association Analysis of Document Pairs  Latent Association Analysis (LAA) mines the topics of two document sets simultaneously, taking the bipartite network between two document sets into consideration  One of the first attempts to analyze the topic structures of two connected document sets, aiming to infer their mapping network model  LAA significantly outperforms existing algorithms with 70% accuracy improvement Topic Simplex for Corpus 1 ? Topic Simplex for Corpus 2 ? Correlation Factor …… Document Pairs Gengxin Miao, et al., “Latent Association Analysis of Document Pairs”, KDD’11 (sub) 33

E1.2: Collaborative Network Modeling And Inference 34 Questions: 1.How to model it? 2.How information flows among different agents? 3.How agents interact with each other? 4.How to measure the quality of the flow? 5.Is there any mis-interaction among these agents? 6.Can we identify the role of the agents? 7.Can we identify the relationship? 8.Can we identify the weak components?