Paper Presentation Social influence based clustering of heterogeneous information networks Qiwei Bao & Siqi Huang
What’s it all about? There’s a growing interest in Clustering a social network of people based on their social relationships and their participation in information networks. This paper makes use of the concept of social influence to improve the clustering quality. Social Influence studies how the impact of people’s activity /opinions propagating towards members of a social network, via direct and indirect social connections.
Keywords Graph Clustering Heterogeneous Network Kernels Social Influence
Today’s Presentation Part One: Definitions Concepts Kernels Similarity Measurement Part Two: Clustering Algorithm – SI CLUSTERING Parameter-based Optimization Experiments Conclusions
Problem Statement Model activities/events/experiences as information networks in addition to social relationships of people. Social influence can propagate through networks: 1. Self – influence: people influence one another based solely on the social network; 2. Co – influence: people influence one another through individuals’ participation in some activity/event networks. TWO KINDS OF INFLUENCE
Problem Statement Social Collaboration Network (Social Graph/ SG) THREE TYPES OF GRAPHS/NETWORKS SG = (U, E) U: set of vertices, members of the social network (e.g., authors, customers.) E: Set of edges denoting the collaborative relationships between the members. N SG : the size of U.
Problem Statement Associated Activity Network (Activity Graph/ AG i ) THREE TYPES OF GRAPHS/NETWORKS AG i = (V i, S i ) V i : Activity vertices in the i th associated activity network AG i. S i : Weighted edges representing the similarity between two activity vertices. N AG i : the size of each activity vertex set.
Problem Statement Influence Network (Influence Graph/ IG i ) THREE TYPES OF GRAPHS/NETWORKS
Problem Statement HETEROGENEOUS NETWORK When you consider both Self-influence and Co- influence networks, the network as a whole is Heterogeneous.
Problem Statement HETEROGENEOUS NETWORK
Problem Statement Given a social graph, multiple activity graphs and corresponding influence graphs. Problem: Partition the member vertices U into K disjoint clusters U i A desired clustering result should achieve a good balance: (1) Vertices within one cluster should have similar collaborative patterns among themselves and similar interaction patterns with activity networks; (2) Vertices in different clusters should have dissimilar collaborative patterns and dissimilar interaction patterns with activities S ocial I nfluence-based graph Cluster ing (SI-Cluster)
Problem Statement Clustering algorithm should be fast and scalable to the number of influence graphs and the size of the activity graphs S ocial I nfluence-based graph Cluster ing (SI-Cluster)
Dataset DBLP Dataset It consists of two types of entities: authors and conferences and three types of links: co-authorship, author-conference, conference similarity.
Influence-based Similarity Step 1: Heat Diffusion on Social Graph
Influence-based Similarity Step 2: Compute Self-influence Similarity
Influence-based Similarity Co-influence Kernel on Influence Graph Non-propagating heat diffusion kernel Hi for each influence graph IG i (one hop)
Influence-based Similarity Co-influence Kernel on Influence Graph
Influence-based Similarity Step 3: Compute Propagating Co-influence Kernel on Influence Graph Philip S. Yu and his co- authors with more than 45 co-publications
Influence-based Similarity Step 4: Partition Activities into Clusters Philip S. Yu and his co- authors with more than 45 co-publications
Influence-based Similarity Propagate Heat Distribution Initial the heat distribution f ij (0) for each cluster c ij in each influence graph IG i
Influence-based Similarity Step 5: Compute Influence Score Based on Co-influence Model
Influence-based Similarity Step 6: Compute Co-influence Similarity Philip S. Yu and his co- authors with more than 45 co-publications
Influence-based Similarity Step 6: Compute Co-influence Similarity Co-influence Similarity Matrix Wi for each influence graph IGi Step 7: Compute Unified Co-influence based Similarity
SI- Clustering Algorithm What is it? Initialization the most centrally located point in a cluster as a centroid assign the rest of points to their closest centroids Clustering convergence Clustering objective Calculate Update N + 1 weights iteration
SI- Clustering Algorithm Cont. Initialization
SI- Clustering Algorithm Cont. Vertex Assignment and Centroid Update Update centroid with the most centrally located vertex in each cluster
SI- Clustering Algorithm Cont. Clustering Objective Function
SI- Clustering Algorithm Cont. Clustering Objective Function Cont.
Simplified: (1) cluster assignment (2) centroid update (3) weight adjustment SI- Clustering Algorithm Cont. Clustering Objective Function Cont. common to all partitioning clustering algorithms
SI- Clustering Algorithm Cont. Parameter-based Optimization
SI- Clustering Algorithm Cont. Parameter-based Optimization Cont.
SI- Clustering Algorithm Cont. Parameter-based Optimization Cont.
SI- Clustering Algorithm Cont. Parameter-based Optimization Cont.
SI- Clustering Algorithm Cont. Parameter-based Optimization Cont.
SI- Clustering Algorithm Cont. Parameter-based Optimization Cont.
SI- Clustering Algorithm Cont. Parameter-based Optimization Cont.
The procedure of solving this NPPP optimization problem includes two parts: (1) find such a reasonable parameter β (F(β) = 0), making NPPP equivalent to NFPP; (2) given the parameter β, solve a polynomial programming problem about the original variables. SI- Clustering Algorithm Cont. Adaptive Weight Adjustment & Clustering Algorithm
Amazon product co-purchasing network 20,000 products activity graphs: product category graph and customer review graph DBLP bibliography data - A full version: 964,166 authors activity graphs: Conference and Keyword - A subset of DBLP data: 100,000 authors activity graphs: Conference and Keyword Evaluation Datasets
Algorithms to be compare - BAGC - SA-Cluster - Inc-Cluster - W-Cluster Measures - Density: - Entropy - Davies-Bouldin Index Evaluation Cont. Baseline Methods
Dataset: 200,000 Amazon products. The number of clusters: K = 40, 60, 80, 100. Evaluation Cont. Cluster quality evaluation
Dataset: DBI on DBLP with 100, 000 authors. The number of clusters: K = 400, 600, 800, Evaluation Cont. Cluster quality evaluation Cont.
Dataset: DBI on DBLP with 964, 166 authors. The number of clusters: K = 4000, 6000, 8000, Evaluation Cont. Cluster quality evaluation Cont.
Evaluation Cont. Cluster efficiency evaluation
Observation: Both the social weight and the keyword weight are increasing but the conference weight is decreasing with more iterations. Explanation: People who have many publications in the same conferences may have different research topics but people who have many papers with the same keywords usually have the same research topics, and thus have a higher collaboration probability as co-authors. Evaluation Cont. Cluster convergence
Evaluation Cont. Case Study
Undefined influence- based model Webs Evaluation Compute vertex similarity Update Centroid Conclusion link entities Static activities Dynamic activities SI-Clustering a sophisticated nonlinear fractional programming problem a straightforward nonlinear parametric programming problem
Integrated different types of links, entities, static attributes and dynamic activities from different networks into a unifying influence-based model. Proposed an iterative learning algorithm. Transformed a sophisticated nonlinear fractional programming problem of multiple weights into a straightforward nonlinear parametric programming problem of single variable to speed up the clustering process. Conclusion Cont.
Thanks ! Q&A ?